1 |
Between History and Natural Language Processing: Study, Enrichment and Online Publication of French Parliamentary Debates of the Early Third Republic (1881-1899)
|
|
|
|
In: ParlaCLARIN III at LREC2022 - Workshop on Creating, Enriching and Using Parliamentary Corpora ; https://hal.archives-ouvertes.fr/hal-03623351 ; ParlaCLARIN III at LREC2022 - Workshop on Creating, Enriching and Using Parliamentary Corpora, Jun 2022, Marseille, France ; https://www.clarin.eu/ParlaCLARIN-III (2022)
|
|
Abstract:
International audience ; We present the AGODA (Analyse sémantique et Graphes relationnels pour l'Ouverture des Débats à l'Assemblée nationale) project, which aims to create a platform for consulting and exploring digitised French parliamentary debates (1881-1940) available in the digital library of the National Library of France. This project brings together historians and NLP specialists: parliamentary debates are indeed an essential source for French history of the contemporary period, but also for linguistics. This project therefore aims to produce a corpus of texts that can be easily exploited with computational methods, and that respect the TEI standard. Ancient parliamentary debates are also an excellent case study for the development and application of tools for publishing and exploring large historical corpora. In this paper, we present the steps necessary to produce such a corpus. We detail the processing and publication chain of these documents, in particular by mentioning the problems linked to the extraction of texts from digitised images. We also introduce the first analyses that we have carried out on this corpus with "bag-of-words" techniques not too sensitive to OCR quality (namely topic modelling and word embedding).
|
|
Keyword:
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; [INFO.INFO-CY]Computer Science [cs]/Computers and Society [cs.CY]; [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; [SHS.HIST]Humanities and Social Sciences/History; France; OCR; Parliamentary debates; Third Republic; Topic modelling; Word embedding; XML-TEI
|
|
URL: https://hal.archives-ouvertes.fr/hal-03623351/document https://hal.archives-ouvertes.fr/hal-03623351 https://hal.archives-ouvertes.fr/hal-03623351/file/puren_bourgeois_pellet_vernus_agoda2022.pdf
|
|
BASE
|
|
Hide details
|
|
6 |
Terminological Methods in Lexicography: Conceptualising, Organising, and Encoding Terms in General Language Dictionaries
|
|
|
|
BASE
|
|
Show details
|
|
7 |
Giving Depth to TEI-Based Descriptions of Manuscripts: The Golden Gospel of Ham
|
|
|
|
In: Aethiopica; Bd. 24 (2021); 175–211 ; Aethiopica; Vol. 24 (2021); 175–211 ; 2194-4024 ; 1430-1938 ; 10.15460/aethiopica.24.0 (2022)
|
|
BASE
|
|
Show details
|
|
8 |
Towards an Online Database of Ancient Dramatic Meters
|
|
|
|
In: FuturoClassico FCl; N. 7 (2021); 143-164 ; 2465-0951 (2022)
|
|
BASE
|
|
Show details
|
|
9 |
Understanding and reading XML ; Comprendre et lire le XML
|
|
|
|
In: https://halshs.archives-ouvertes.fr/halshs-03637142 ; École thématique. Comprendre et lire le XML, Bibliothèque du lab. CRISCO EA 4255, France. 2021, pp.72 ; Comprendre et lire le XML (2021)
|
|
BASE
|
|
Show details
|
|
10 |
XML and namespaces ; XML et espaces de nom
|
|
|
|
In: https://halshs.archives-ouvertes.fr/halshs-03637189 ; Doctorat. XML et espaces de nom, Bibliothèque du lab. CRISCO EA 4255, France. 2021, pp.44 ; XML et espaces de nom (2021)
|
|
BASE
|
|
Show details
|
|
11 |
Language Processing in Digital Editions of Russian 18 th Century Texts ; Лингвистическая обработка цифровых изданий русских текстов XVIII века
|
|
|
|
In: Corpora 2021 International Conference ; https://halshs.archives-ouvertes.fr/halshs-03285725 ; Corpora 2021 International Conference, Saint-Petersburg State University, Jul 2021, Saint-Petersbourg, Russia ; https://events.spbu.ru/events/corpora-2021 (2021)
|
|
BASE
|
|
Show details
|
|
12 |
La Base de français médiéval et le consortium CAHIER : dix ans d'échanges et de collaborations
|
|
|
|
In: 10 ans avec CAHIER. Des corpus d'auteurs pour les humanités à leur exploitation numérique ; https://halshs.archives-ouvertes.fr/halshs-03363517 ; 10 ans avec CAHIER. Des corpus d'auteurs pour les humanités à leur exploitation numérique, Jun 2021, Bordeaux, France ; https://cahier10.sciencesconf.org/344494 (2021)
|
|
BASE
|
|
Show details
|
|
13 |
Expanding the content model of annotationBlock
|
|
|
|
In: Next Gen TEI, 2021 - TEI Conference and Members’ Meeting ; https://hal.archives-ouvertes.fr/hal-03380805 ; Next Gen TEI, 2021 - TEI Conference and Members’ Meeting, Oct 2021, Virtual, United States (2021)
|
|
BASE
|
|
Show details
|
|
20 |
Multilingual comparable corpora of parliamentary debates ParlaMint 2.1
|
|
|
|
BASE
|
|
Show details
|
|
|
|