DE eng

Search in the Catalogues and Directories

Hits 1 – 3 of 3

1
Unsupervised Learning for Handling Code-Mixed Data: A Case Study on POS Tagging of North-African Arabizi Dialect
In: EurNLP - First annual EurNLP ; https://hal.archives-ouvertes.fr/hal-02270527 ; EurNLP - First annual EurNLP, Oct 2019, Londres, United Kingdom (2019)
Abstract: International audience ; Language model pretrained representation are now ubiquitous in Natural Language Processing. In this work, we present some first results in adapting those models to Out-of-Domain textual data. Using Part-of-Speech tagging as our case study, we analyze the ability of BERT to model a complex North-African Dialect (Arabizi). What is Arabizi ? BERT and Arabizi We do our experiments on the released base multilingual version of BERT (Delvin et al. 2018) which was trained on a concatenation of Wikipedia of 104 languages. BERT has never seen any Arabizi. It is visible that Arabizi is related to French in BERT's embedding space Summary • Multilingual-BERT can be used to build a decent Part-of-Speech Tagger with a reasonable amount of annotated data • Unsupervised adaptation improves (+1) performance in downstream POS tagging Research questions • Is BERT able to model Out-of-Domain languages such as Arabizi ? • Can we adapt BERT in an unsupervised way to Arabizi ? Definitions • Dialectal Arabic is a variation of Classic Arabic that varies from one region to another that is spoken orally only. Darija is the one spoken in Maghreb (Algeria, Tunisia, Morocco). • Arabizi is the name given to the transliterated language of dialectal Arabic in Latin script mostly found online. Key Property : High Variability • No spelling, morphological or syntactic fixed norms • Strong influence from foreign languages • Code-Switching French / Darija Unsupervised Fine Tuning of BERT on Arabizi We fine-tune BERT (MLM objective) on the 200k Arabizi sentences Results Collecting and filtering raw Arabizi Data We bootstrap a data set for Arabizi starting from 9000 sentences collected by Cotterell et al. (2014). Using keywords scraping, we collect 1 million UGC sentences comprising French, English and Arabizi. We filter 200k Arabizi sentences out of the raw corpus (94% F1 score) using our language identifier (cf. Figure below). Lexical Normalization We train a clustering lexical normalizer using edit and word2vec distances. This degrades downstream performances in POS tagging. A new Treebank The first bottleneck in analyzing such a dialect is the lack of annotated resources. We developed a CoNLL-U Treebank** that includes Part-of-Speech, dependencies, and the translations of 1500 sentences (originally posted in Facebook, Echorouk newspaper…). Model Accuracy Baseline (udpipe) 73.7 Baseline + Normalization (udpipe) 72.4 BERT + POS tuning 77.3 BERT + POS tuning + Normalization (udpipe) 69.9 BERT + Unsupervised Domain fine tuning+ POS tuning 78.3 Final performance. Accuracy reported on the test set averaged over 5 runs Figure 2 : Validation accuracy while fine tuning BERT on Arabizi data (200k sentence) X1000 iteration Accuracy Masked Language Model French Wikipedia Arabizi vive mca w nchalah had l'3am championi Arabizi long live MCA and I hope that this year we will be champions English
Keyword: [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing
URL: https://hal.archives-ouvertes.fr/hal-02270527/file/poster-EURNLP.pdf
https://hal.archives-ouvertes.fr/hal-02270527/document
https://hal.archives-ouvertes.fr/hal-02270527
BASE
Hide details
2
CamemBERT: a Tasty French Language Model
In: https://hal.inria.fr/hal-02445946 ; 2019 (2019)
BASE
Show details
3
Enhancing BERT for Lexical Normalization
In: The 5th Workshop on Noisy User-generated Text (W-NUT) ; https://hal.inria.fr/hal-02294316 ; The 5th Workshop on Noisy User-generated Text (W-NUT), Nov 2019, Hong Kong, China (2019)
BASE
Show details

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
3
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern