DE eng

Search in the Catalogues and Directories

Hits 1 – 10 of 10

1
Standard-based Lexical Models for Automatically Structured Dictionaries ; Modèles lexicaux standardisés pour les dictionnaires à structure automatique
Khemakhem, Mohamed. - : HAL CCSD, 2020
In: https://tel.archives-ouvertes.fr/tel-03153438 ; Computation and Language [cs.CL]. Université de Paris, 2020. English (2020)
BASE
Show details
2
How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures
In: 19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond ; https://hal.archives-ouvertes.fr/hal-02263276 ; 19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond, Sep 2019, Graz, Austria (2019)
BASE
Show details
3
Nénufar: Modelling a Diachronic Collection of Dictionary Editions as a Computational Lexical Resource
In: ELEX 2019: smart lexicography ; https://hal.inria.fr/hal-02272978 ; ELEX 2019: smart lexicography, Oct 2019, Sintra, Portugal (2019)
BASE
Show details
4
LMF Reloaded
In: AsiaLex 2019: Past, Present and Future ; https://hal.inria.fr/hal-02118319 ; AsiaLex 2019: Past, Present and Future, Jun 2019, Istanbul, Turkey (2019)
BASE
Show details
5
TEI Encoding of a Classical Mixtec Dictionary Using GROBID- Dictionaries
In: ELEX 2019: Smart Lexicography ; https://hal.inria.fr/hal-02264033 ; ELEX 2019: Smart Lexicography, Oct 2019, Sintra, Portugal ; https://elex.link/elex2019/ (2019)
BASE
Show details
6
Enhancing Usability for Automatically Structuring Digitised Dictionaries
In: GLOBALEX workshop at LREC 2018 ; https://hal.archives-ouvertes.fr/hal-01708137 ; GLOBALEX workshop at LREC 2018, May 2018, Miyazaki, Japan (2018)
BASE
Show details
7
Retro-digitizing and Automatically Structuring a Large Bibliography Collection
In: European Association for Digital Humanities (EADH) Conference ; https://hal.archives-ouvertes.fr/hal-01941534 ; European Association for Digital Humanities (EADH) Conference, EADH, Dec 2018, Galway, Ireland (2018)
BASE
Show details
8
Automatically Encoding Encyclopedic-like Resources in TEI
In: The annual TEI Conference and Members Meeting ; https://hal.inria.fr/hal-01819505 ; The annual TEI Conference and Members Meeting, Sep 2018, Tokyo, Japan ; https://tei2018.dhii.asia/ (2018)
BASE
Show details
9
Presenting the Nénufar Project: a Diachronic Digital Edition of the Petit Larousse Illustré
In: GLOBALEX 2018 - Globalex workshop at LREC2018 ; https://hal.archives-ouvertes.fr/hal-01728328 ; GLOBALEX 2018 - Globalex workshop at LREC2018, May 2018, Miyazaki, Japan. pp.1-6 ; https://globalex.link/globalex2018/ (2018)
BASE
Show details
10
Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields
In: electronic lexicography, eLex 2017 ; https://hal.archives-ouvertes.fr/hal-01508868 ; electronic lexicography, eLex 2017, Sep 2017, Leiden, Netherlands (2017)
Abstract: International audience ; An important number of digitized lexical resources remain unexploited due to their unstructured content. Manually structuring such resources is a costly task given their multifold complexity. Our goal is to find an approach to automatically structure digitized dictionaries, independently from the language or the lexicographic school or style. In this paper we present a first version of GROBID-Dictionaries1, an open source machine learning system for lexical information extraction.Our approach is twofold: we perform a cascading structure extraction, while we select at each level specific features for training.We followed a ”divide to conquer” strategy to dismantle text constructs in a digitized dictionary, based on the observation of their layout. Main pages (see Figure 1) in almost any dictionary share three blocks: a header (green), a footer (blue) and a body (orange). The body is, in its turn, constituted by several entries (red). Each lexical entry can be further decomposed (see Figure 2) as: form (green), etymology (blue), sense (red) or/and related entry. The same logic could be applied further for each extracted block but in the scope of this paper we focus just on the first three levels.The cascading approach ensures a better understanding of the learning process’s output and consequently simplifies the feature selection process. Limited exclusive text blocks per level helps significantly in diagnosing the cause of prediction errors. It allows an early detection and replacement of irrelevant selected features that can bias a trained model. In such a segmentation, it becomes more straightforward to notice that, for instance, the token position in the page is very relevant to detect headers and footers and has almost no pertinence for capturing a sense in a lexical entry which is very often split on two pages.To implement our approach, we took up the available infrastructure from GROBID [7], a machine learning system for the extraction of bibliographic metadata. GROBID adopts the same cascading approach and uses Conditional Random Fields (CRF) [6] to label text sequences. The output of Grobid dictionary is planned to generate a TEI compliant encoding [2, 9] where the various segmentation levels are associated with an appropriate XML tessellation. Collaboration with COST ENeL are ongoing to ensure maximal compatibility with existing dictionary projects.Our experiments justify so far our choices, where models for the first two levels trained on two different dictionary samples have given a high precision and recall with a small amount of annotated data. Relying mainly on the text layout, we tried to diversify the selected features for each model, on the token and line levels. We are working on tuning features and annotating more data to maintain the good results with new samples and to improve the third segmentation level.While just few task specific attempts [1] have been using machine learning in this research direction, the landscape remains dominated by rule based techniquess [4, 3, 8] which are ad-hoc and costly, even impossible, to adapt for new lexical resources.
Keyword: [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; [SHS.LANGUE]Humanities and Social Sciences/Linguistics; [STAT.ML]Statistics [stat]/Machine Learning [stat.ML]; automatic structuring; CRF; digitized dictionaries; machine learning; TEI
URL: https://hal.archives-ouvertes.fr/hal-01508868v2/document
https://hal.archives-ouvertes.fr/hal-01508868
https://hal.archives-ouvertes.fr/hal-01508868v2/file/eLex-2017-Template.pdf
BASE
Hide details

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
10
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern