Home Catalogue search

eng

Refine your search:

Search in the Catalogues and Directories






	Sort by
Simple Search

Page: 1 2 3 4 5 6...13

Hits 21 – 40 of 258

21	How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures
	Khemakhem, Mohamed; Galleron, Ioana; Williams, Geoffrey...
	In: 19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond ; https://hal.archives-ouvertes.fr/hal-02263276 ; 19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond, Sep 2019, Graz, Austria (2019)
	BASE
	Show details

22	Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
	Ortiz Suárez, Pedro Javier; Sagot, Benoît; Romary, Laurent
	In: 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) ; https://hal.inria.fr/hal-02148693 ; 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7), Jul 2019, Cardiff, United Kingdom. ⟨10.14618/IDS-PUB-9021⟩ (2019)
	Abstract: International audience ; Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.
	Keyword: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
	URL: https://hal.inria.fr/hal-02148693/file/Asynchronous_Pipeline_for_Processing_Huge_Corpora_on_Medium_to_Low_Resource_Infrastructures.pdf https://doi.org/10.14618/IDS-PUB-9021 https://hal.inria.fr/hal-02148693 https://hal.inria.fr/hal-02148693/document
	BASE
	Hide details

23	Nénufar: Modelling a Diachronic Collection of Dictionary Editions as a Computational Lexical Resource
	Bohbot, Hervé; Frontini, Francesca; Khan, Fahad...
	In: ELEX 2019: smart lexicography ; https://hal.inria.fr/hal-02272978 ; ELEX 2019: smart lexicography, Oct 2019, Sintra, Portugal (2019)
	BASE
	Show details

24	LMF Reloaded
	Romary, Laurent; Khemakhem, Mohamed; Khan, Fahad...
	In: AsiaLex 2019: Past, Present and Future ; https://hal.inria.fr/hal-02118319 ; AsiaLex 2019: Past, Present and Future, Jun 2019, Istanbul, Turkey (2019)
	BASE
	Show details

25	TEI Encoding of a Classical Mixtec Dictionary Using GROBID- Dictionaries
	Bowers, Jack; Khemakhem, Mohamed; Romary, Laurent
	In: ELEX 2019: Smart Lexicography ; https://hal.inria.fr/hal-02264033 ; ELEX 2019: Smart Lexicography, Oct 2019, Sintra, Portugal ; https://elex.link/elex2019/ (2019)
	BASE
	Show details

26	CamemBERT: a Tasty French Language Model
	Martin, Louis; Muller, Benjamin; Ortiz Suárez, Pedro Javier...
	In: https://hal.inria.fr/hal-02445946 ; 2019 (2019)
	BASE
	Show details

27	TEI and the Mixtepec-Mixtec corpus: data integration, annotation and normalization of heterogeneous data for an under-resourced language
	Bowers, Jack; Romary, Laurent
	In: 6th International Conference on Language Documentation and Conservation (ICLDC) ; https://hal.inria.fr/hal-02075475 ; 6th International Conference on Language Documentation and Conservation (ICLDC), Feb 2019, Honolulu, United States (2019)
	BASE
	Show details

28	Preparing the Dictionnaire Universel for Automatic Enrichment
	Ortiz Suárez, Pedro Javier; Romary, Laurent; Sagot, Benoît
	In: 10th International Conference on Historical Lexicography and Lexicology (ICHLL) ; https://hal.inria.fr/hal-02131598 ; 10th International Conference on Historical Lexicography and Lexicology (ICHLL), Jun 2019, Leeuwarden, Netherlands ; https://easychair.org/smart-program/ICHLL-10/ (2019)
	BASE
	Show details

29	Connecting the Humanities through Research Infrastructures
	Bassett, Sheena; Wessels, Leon; Krauwer, Steven...
	In: 4th Digital Humanities in the Nordic Countries (DHN 2019) ; https://hal.inria.fr/hal-02047512 ; 4th Digital Humanities in the Nordic Countries (DHN 2019), Mar 2019, Copenhagen, Denmark ; https://cst.dk/DHN2019/DHN2019.html (2019)
	BASE
	Show details

30	The place of lexicography in (computer) science
	Romary, Laurent
	In: The Future of Academic Lexicography: Linguistic Knowledge Codification in the Era of Big Data and AI ; https://hal.inria.fr/hal-02358218 ; The Future of Academic Lexicography: Linguistic Knowledge Codification in the Era of Big Data and AI, Frieda Steurs; Dirk Geeraerts; Niels Schiller; Marian Klamer; Iztok Kosem, Nov 2019, Leiden, Netherlands ; https://www.lorentzcenter.nl/lc/web/2019/1177/program.php3?wsid=1177&venue=Oort (2019)
	BASE
	Show details

31	Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures ...
	Ortiz Suárez, Pedro Javier; Sagot, Benoît; Romary, Laurent. - : Leibniz-Institut für Deutsche Sprache, 2019
	BASE
	Show details

32	From disparate disciplines to unity in diversity. How the PARTHENOS project brings Humanities Research Infrastructures together ...
	Uiterwaal, Frank; Niccolucci, Franco; Bassett, Sheena. - : Zenodo, 2019
	BASE
	Show details

33	From disparate disciplines to unity in diversity. How the PARTHENOS project brings Humanities Research Infrastructures together ...
	Uiterwaal, Frank; Niccolucci, Franco; Bassett, Sheena. - : Zenodo, 2019
	BASE
	Show details

34	LMF Reloaded ...
	Romary, Laurent; Khemakhem, Mohamed; Khan, Fahad. - : arXiv, 2019
	BASE
	Show details

35	LMF Reloaded ...
	Romary, Laurent; Khemakhem, Mohamed; Khan, Anas Fahad. - : Zenodo, 2019
	BASE
	Show details

36	LMF Reloaded ...
	Romary, Laurent; Khemakhem, Mohamed; Khan, Anas Fahad. - : Zenodo, 2019
	BASE
	Show details

37	Automatic TEI encoding of manuscripts catalogues with GROBID-Dictionaries ...
	Noyer, Lucie Rondeau Du; Gabay, Simon; Khmakhem, Mohamed. - : Zenodo, 2019
	BASE
	Show details

38	TEI Lex-0: A Target Format for TEI-Encoded Dictionaries and Lexical Resources ...
	Romary, Laurent; Tasovac, Toma. - : Zenodo, 2019
	BASE
	Show details

39	Automatic TEI encoding of manuscripts catalogues with GROBID-Dictionaries ...
	Noyer, Lucie Rondeau Du; Gabay, Simon; Khmakhem, Mohamed. - : Zenodo, 2019
	BASE
	Show details

40	TEI Lex-0: A Target Format for TEI-Encoded Dictionaries and Lexical Resources ...
	Romary, Laurent; Tasovac, Toma. - : Zenodo, 2019
	BASE
	Show details

Page: 1 2 3 4 5 6...13

© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern