DE eng

Search in the Catalogues and Directories

Hits 1 – 11 of 11

1
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
In: https://hal.inria.fr/hal-03177623 ; 2021 (2021)
Abstract: To appear in the proceedings of the AfricaNLP 2021 workshop. ; With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
Keyword: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
URL: https://hal.inria.fr/hal-03177623
BASE
Hide details
2
MasakhaNER: Named entity recognition for African languages
In: EISSN: 2307-387X ; Transactions of the Association for Computational Linguistics ; https://hal.inria.fr/hal-03350962 ; Transactions of the Association for Computational Linguistics, The MIT Press, 2021, ⟨10.1162/tacl⟩ (2021)
BASE
Show details
3
Modelling Latent Translations for Cross-Lingual Transfer ...
BASE
Show details
4
Can Multilinguality benefit Non-autoregressive Machine Translation? ...
BASE
Show details
5
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets ...
BASE
Show details
6
Evaluating Multiway Multilingual NMT in the Turkic Languages ...
BASE
Show details
7
The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation ...
BASE
Show details
8
Reinforcement Learning for Machine Translation: from Simulations to Real-World Applications ...
Kreutzer, Julia. - : Heidelberg University Library, 2020
BASE
Show details
9
Neural Machine Translation for Extremely Low-Resource African Languages: A Case Study on Bambara ...
BASE
Show details
10
Participatory Research for Low-resourced Machine Translation:A Case Study in African Languages
BASE
Show details
11
Reinforcement Learning for Machine Translation: from Simulations to Real-World Applications
Kreutzer, Julia. - 2020
BASE
Show details

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
11
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern