2006 CoNLL Shared Task - Ten Languages
85.2 Mb
2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006…
Turkish (tur); Bulgarian (bul); Dutch, Fl…
ELRA-W0086
Details
2006 CoNLL Shared Task - Ten Languages
Name
2006 CoNLL Shared Task - Ten Languages (ELRA-W0086)
URL
http://catalog.elra.info/product_info.php?products_id=1250
Description
2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish. The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format.
Languages
Turkish (tur)
Bulgarian (bul)
Dutch, Flemish (dut)
German (ger)
Japanese (jpn)
Spanish, Castilian (spa)
Danish (dan)
Portuguese (por)
Swedish (swe)
Slovenian (slv)
×
select
Al-Hayat Arabic Corpus
1.1 Gb
The corpus contains articles extracted from the newspeper Al-Hayat, organised in 7 domains, for language engineering ap…
Arabic (ara)
ELRA-W0030
Details
select
Amaryllis Corpus - Evaluation Package
505 Mb
AMARYLLIS was organised by the Institut de l'Information Scientifique et Technique (INIST) with the support of the Agen…
French (fre)
ELRA-W0029
Details
Amaryllis Corpus - Evaluation Package
Name
Amaryllis Corpus - Evaluation Package (ELRA-W0029)
URL
http://catalog.elra.info/product_info.php?products_id=626
Description
AMARYLLIS was organised by the Institut de l'Information Scientifique et Technique (INIST) with the support of the Agence francophone pour l'enseignement supérieur et la recherche (AUPELF-UREF) and the French Ministère de l'Education Nationale, de la Recherche et de la Technologie (MERT) to create document corpora, questions and answers, in the framework of the Action de Recherche Concertée (ARC A1, renamed as Amaryllis- Access to text information in French), in order to get similar works to the United States project TREC. All corpora are structured as SGML files with isolatin character-encoding.
Languages
French (fre)
×
select
Amharic-English bilingual corpus
15 Mb
The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in translite…
English (eng); Amharic (amh)
…
ELRA-W0074
Details
Amharic-English bilingual corpus
Name
Amharic-English bilingual corpus (ELRA-W0074)
URL
http://catalog.elra.info/product_info.php?products_id=1215
Description
The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in transliterated form and in English. The size of the corpus is of 232,653 words in Amharic and 291,701 in English.
Languages
English (eng)
Amharic (amh)
×
select
An-Nahar Newspaper Text Corpus
794 Mb
The An-Nahar Newspaper Text Corpus comprises articles in Arabic (Lebanon) from 1995 to 2000 (6 years) stored as HTML fi…
Arabic (ara)
ELRA-W0027
Details
An-Nahar Newspaper Text Corpus
Name
An-Nahar Newspaper Text Corpus (ELRA-W0027)
URL
http://catalog.elra.info/product_info.php?products_id=767
Description
The An-Nahar Newspaper Text Corpus comprises articles in Arabic (Lebanon) from 1995 to 2000 (6 years) stored as HTML files onCDRommedia. Each yearcontains 45000 articles and 24 million words.
Languages
Arabic (ara)
×
select
Arboretum treebank
26 Mb
The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of …
Danish (dan)
ELRA-W0084
Details
Arboretum treebank
Name
Arboretum treebank (ELRA-W0084)
URL
http://catalog.elra.info/product_info.php?products_id=1248
Description
The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of about 425,000 tokens and there are ca. 22,260 sentences/utterances containing 3 or more tokens. Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions:
1. Native dependency format (Constraint Grammar format)
2. Dependency annotation converted to MALT xml format
3. Native constituent tree format (Cross-language VISL standard)
4. Constituent format converted to TIGER xml
Languages
Danish (dan)
×
select
ARCADE/ROMANSEVAL corpus
63 Mb
The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (…
English (eng); French (fre); Italian (ita…
ELRA-W0018
Details
ARCADE/ROMANSEVAL corpus
Name
ARCADE/ROMANSEVAL corpus (ELRA-W0018)
URL
http://catalog.elra.info/product_info.php?products_id=535
Description
The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission). The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3700 contexts all together. It comprises: semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian; and word-level alignment of all the occurrences of the test words between French and English.
Languages
English (eng)
French (fre)
Italian (ita)
×
select
A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version
23 Mb
Produced through a funding from ELRA in the framework of the European Commission project LRsPProduced through a funding…
French (fre)
ELRA-W0025-02
Details
A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version
Name
A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version (ELRA-W0025-02)
URL
http://catalog.elra.info/product_info.php?products_id=595
Description
Produced through a funding from ELRA in the framework of the European Commission project LRsPProduced through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335), the corpus contains all articles published in La Recherche magazine in 1998, including issues 305 (January) to 315 (December), which amounts to 447,244 tokens and 30,238 types. Two versions are available: the raw data (XML format) and the complete version (XML and SGML formats)
Languages
French (fre)
×
select
Catalan Corpus of News Articles
645 Mb
The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007. These articles …
Catalan, Valencian (cat)
ELRA-W0047
Details
Catalan Corpus of News Articles
Name
Catalan Corpus of News Articles (ELRA-W0047)
URL
http://catalog.elra.info/product_info.php?products_id=990
Description
The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007. These articles are grouped per trimester without chronological order inside.
Languages
Catalan, Valencian (cat)
×
select
Catalan-Spanish Parallel Corpus
686 Mb
This corpus contains more than 100 million words and it contains 10 years of bilingual articles from El Periódico de Ca…
Spanish, Castilian (spa); Catalan, Valenc…
ELRA-W0053
Details
Catalan-Spanish Parallel Corpus
Name
Catalan-Spanish Parallel Corpus (ELRA-W0053)
URL
http://catalog.elra.info/product_info.php?products_id=1122
Description
This corpus contains more than 100 million words and it contains 10 years of bilingual articles from El Periódico de Catalunya. The data are aligned at sentence level and stored in text files, in a one sentence per line basis. The data are provided in plain text, with no encoding whatsoever.
Languages
Spanish, Castilian (spa)
Catalan, Valencian (cat)
×
select
CINTIL-DeepBank
213 Mb
The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical r…
Portuguese (por)
ELRA-W0062
Details
CINTIL-DeepBank
Name
CINTIL-DeepBank (ELRA-W0062)
URL
http://catalog.elra.info/product_info.php?products_id=1181
Description
The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical representations, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.
Languages
Portuguese (por)
×
select
CINTIL-DependencyBank
1.4 Mb
The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency g…
Portuguese (por)
ELRA-W0061
Details
CINTIL-DependencyBank
Name
CINTIL-DependencyBank (ELRA-W0061)
URL
http://catalog.elra.info/product_info.php?products_id=1180
Description
The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency graphs and grammatical function tags composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus.
Languages
Portuguese (por)
×
select
CINTIL-PropBank
3.6 Mb
The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, compos…
Portuguese (por)
ELRA-W0056
Details
CINTIL-PropBank
Name
CINTIL-PropBank (ELRA-W0056)
URL
http://catalog.elra.info/product_info.php?products_id=1176
Description
The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.
Languages
Portuguese (por)
×
select
CINTIL-TreeBank
3.1 Mb
The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 11…
Portuguese (por)
ELRA-W0055
Details
CINTIL-TreeBank
Name
CINTIL-TreeBank (ELRA-W0055)
URL
http://catalog.elra.info/product_info.php?products_id=1174
Description
The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus.
Languages
Portuguese (por)
×
select
Corpus of Contemporaneous Spanish Novels
4.8 Mb
This corpus consists of 11 novels written in Castilian Spanish by Inmaculada Ferrer-Vidal Turull, a contemporaneous aut…
Spanish, Castilian (spa)
ELRA-W0041
Details
Corpus of Contemporaneous Spanish Novels
Name
Corpus of Contemporaneous Spanish Novels (ELRA-W0041)
URL
http://catalog.elra.info/product_info.php?products_id=847
Description
This corpus consists of 11 novels written in Castilian Spanish by Inmaculada Ferrer-Vidal Turull, a contemporaneous author.
Languages
Spanish, Castilian (spa)
×
select
CRATER 2 Corpus
359 Mb
The CRATER 2 parallel corpus is an extension of the CRATER corpus, available in the catalogue under reference W0003. It…
English (eng); French (fre); Spanish, Cas…
ELRA-W0033
Details
CRATER 2 Corpus
Name
CRATER 2 Corpus (ELRA-W0033)
URL
http://catalog.elra.info/product_info.php?products_id=636
Description
The CRATER 2 parallel corpus is an extension of the CRATER corpus, available in the catalogue under reference W0003. It consists of 1,500,000 tokens for English and French and of 1,000,000 tokens for Spanish, with morphosyntactical annotations.
CRATER 2 (ref. ELRA-W0033) includes CRATER (ref. ELRA-W0003)
Languages
English (eng)
French (fre)
Spanish, Castilian (spa)
×
select
deL1L2IM corpus
2.8 Mb
The deL1L2IM corpus is composed of 72 dialogues, each of them having a duration of 20 to 45 minutes. The whole corpus c…
German (ger)
ELRA-W0083
Details
deL1L2IM corpus
Name
deL1L2IM corpus (ELRA-W0083)
URL
http://catalog.elra.info/product_info.php?products_id=1243
Description
The deL1L2IM corpus is composed of 72 dialogues, each of them having a duration of 20 to 45 minutes. The whole corpus contains ca. 52,000 words and 4,800 messages and has a file size of 0.5 Mb. Nine pairs of participants i.e. nine learners and four native speakers were required, with 8 dialogues per pair. The interactions have undergone linguistic analysis whereby the annotation will be performed only on repair/correction sequences (incomplete learner error annotation). The corpus is delivered in one written text file (in XML format, customized under TEI P5).
Languages
German (ger)
×
select
Dutch PAROLE Distributable Corpus
70 Mb
This Dutch corpus is a 3 million words selection built according to the specifications of the PAROLE project. Over 250,…
Dutch, Flemish (dut)
ELRA-W0019
Details
Dutch PAROLE Distributable Corpus
Name
Dutch PAROLE Distributable Corpus (ELRA-W0019)
URL
http://catalog.elra.info/product_info.php?products_id=543
Description
This Dutch corpus is a 3 million words selection built according to the specifications of the PAROLE project. Over 250,000 words of corpus texts (with TEI markup suppressed) have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked.
Languages
Dutch, Flemish (dut)
×
select
ECI-ELSNET Italian & German tagged sub-corpus
3 Mb
The data is extracted from the ECI corpus (the German Frankfurter Rundschau part) and the Italian corpus of ILC/CNR. It…
German (ger); Italian (ita)
…
ELRA-W0005
Details
ECI-ELSNET Italian & German tagged sub-corpus
Name
ECI-ELSNET Italian & German tagged sub-corpus (ELRA-W0005)
URL
http://catalog.elra.info/product_info.php?products_id=86
Description
The data is extracted from the ECI corpus (the German Frankfurter Rundschau part) and the Italian corpus of ILC/CNR. It contains the following domains: Economy (17,000 words), Politics (14,000 words), Culture (18,000 words), Sports (9,000 words), Local Events (8,500 words).
Languages
German (ger)
Italian (ita)
×
select
ECI/MCI (European Corpus Initiative/Multilingual Corpus I)
655 Mb
Over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, M…
Turkish (tur); Albanian (alb); Bulgarian …
ELRA-W0004
Details
ECI/MCI (European Corpus Initiative/Multilingual Corpus I)
Name
ECI/MCI (European Corpus Initiative/Multilingual Corpus I) (ELRA-W0004)
URL
http://catalog.elra.info/product_info.php?products_id=85
Description
Over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more.
Languages
Turkish (tur)
Albanian (alb)
Bulgarian (bul)
Chinese (chi)
Czech (cze)
Dutch, Flemish (dut)
English (eng)
Estonian (est)
French (fre)
Gaelic, Scottish Gaelic (gla)
German (ger)
Greek, Modern (1453-) (gre)
Italian (ita)
Japanese (jpn)
Latin (lat)
Lithuanian (lit)
Malay (may)
Spanish, Castilian (spa)
Serbian (scc)
Danish (dan)
Russian (rus)
Norwegian (nor)
Uzbek (uzb)
Portuguese (por)
Swedish (swe)
×
select
English-Nepali Parallel Corpus
47 Mb
This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligne…
English (eng); Nepali (nep)
…
ELRA-W0077
Details
English-Nepali Parallel Corpus
Name
English-Nepali Parallel Corpus (ELRA-W0077)
URL
http://catalog.elra.info/product_info.php?products_id=1217
Description
This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligned at the sentence level (27,060 English words; 21,756 Nepali words), and a larger set of texts at the document level (617,340 English words; 596,571 Nepali words). An additional set of monolingual data in Nepali is also provided (386,879 words in Nepali).
Languages
English (eng)
Nepali (nep)
×
select
English-Persian parallel Corpus
40 Mb
Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English and …
English (eng); Persian (per)
…
ELRA-W0051
Details
English-Persian parallel Corpus
Name
English-Persian parallel Corpus (ELRA-W0051)
URL
http://catalog.elra.info/product_info.php?products_id=1111
Description
Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English and Persian (Farsi) words aligned at sentence level (about 100,000 sentences). The format of the files is Unicode. It has been originally created with SQL Server, but it is presented in access file type.
Languages
English (eng)
Persian (per)
×
select
EUROPARL Corpus Parallel Corpora: Portuguese-English
2.3 Gb
The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It…
English (eng); Portuguese (por)
…
ELRA-W0090
Details
EUROPARL Corpus Parallel Corpora: Portuguese-English
Name
EUROPARL Corpus Parallel Corpora: Portuguese-English (ELRA-W0090)
URL
http://catalog.elra.info/product_info.php?products_id=1257
Description
The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It contains approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation). It is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file, containing a PoS tag and a lemma for each token.
Languages
English (eng)
Portuguese (por)
×
select
GeFRePaC - German French Reciprocal Parallel Corpus
1.3 Gb
GeFRePac was produced in the framework of the LRsPGeFRePac was produced in the framework of the LRsP&P project. It cont…
French (fre); German (ger)
…
ELRA-W0031
Details
GeFRePaC - German French Reciprocal Parallel Corpus
Name
GeFRePaC - German French Reciprocal Parallel Corpus (ELRA-W0031)
URL
http://catalog.elra.info/product_info.php?products_id=633
Description
GeFRePac was produced in the framework of the LRsPGeFRePac was produced in the framework of the LRsP&P project. It contains 30 million words (15 million for each language) for the purpose of developing, enhancing and improving translation aids.
Languages
French (fre)
German (ger)
×
select
ICE-GB (British English component of the International Corpus of English)
97 Mb
British component of the International Corpus of English (ICE), ICE-GB consists of a million words (83,394 parse trees,…
English (eng)
ELRA-W0021
Details
ICE-GB (British English component of the International Corpus of English)
Name
ICE-GB (British English component of the International Corpus of English) (ELRA-W0021)
URL
http://catalog.elra.info/product_info.php?products_id=762
Description
British component of the International Corpus of English (ICE), ICE-GB consists of a million words (83,394 parse trees, including 59,640 in the spoken part of the corpus) extracted from 200 written and 300 spoken English texts. It is fully grammatically annotated and has been fully checked. ICE-GB is distributed with the retrieval software ICECUP (the International Corpus of English Corpus Utility Program).
Languages
English (eng)
×
select
ILSP/ELEFTHEROTYPIA Corpus (Greek corpus)
27 Mb
This corpus contains approximately 3 million words from the daily newspaper ELEFTHEROTYPIA, classified and annotated ac…
Greek, Modern (1453-) (gre)
…
ELRA-W0022
Details
ILSP/ELEFTHEROTYPIA Corpus (Greek corpus)
Name
ILSP/ELEFTHEROTYPIA Corpus (Greek corpus) (ELRA-W0022)
URL
http://catalog.elra.info/product_info.php?products_id=763
Description
This corpus contains approximately 3 million words from the daily newspaper ELEFTHEROTYPIA, classified and annotated accordingly to the common core PAROLE encoding standard. The format of the corpus is SGML files. A subset of the corpus (250,000 words) is morpho-syntactically tagged; all the words are also lemmatised and checked.
Languages
Greek, Modern (1453-) (gre)
×
select
Italian Syntactic-Semantic Treebank (ISST)
90 Mb
ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in …
Italian (ita)
ELRA-W0044
Details
Italian Syntactic-Semantic Treebank (ISST)
Name
Italian Syntactic-Semantic Treebank (ISST) (ELRA-W0044)
URL
http://catalog.elra.info/product_info.php?products_id=887
Description
ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in XML. This Treebank has a five-level structure covering orthographic, morpho-syntactic, syntactic; semantic and lexico-semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the functional relations level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet (see ELRA-M0018) is the reference lexical resource used for the sense tagging task . Both syntactic and lexico-semantic annotations refer to the morpho-syntactically annotated text, which in turn is linked to the orthographic file with the text and mark-up of macrotextual organisation (e.g. titles, subtitles, summary, body of article, paragraphs).
Languages
Italian (ita)
×
select
Karl May Korpus (KMK)
77 Mb
Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of…
German (ger)
ELRA-W0016
Details
Karl May Korpus (KMK)
Name
Karl May Korpus (KMK) (ELRA-W0016)
URL
http://catalog.elra.info/product_info.php?products_id=450
Description
Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May and consists of around 1.6 million words (divided into 9 sub-corpora of about 180,000 words each).
Languages
German (ger)
×
select
Khresmoi manually annotated reference corpus
1.3 Gb
This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The …
English (eng)
ELRA-W0081
Details
Khresmoi manually annotated reference corpus
Name
Khresmoi manually annotated reference corpus (ELRA-W0081)
URL
http://catalog.elra.info/product_info.php?products_id=1237
Description
This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The corpus is divided into two parts:
1. The initial corpus: 625 documents from the Genetics Home Reference data set, automatically annotated with anatomical locations and diseases, and manually corrected by 3-4 annotators. Size of documents: between 26 and 8,306 tokens each.
2. The main corpus: 6,950 English documents from the Khresmoi crawl and 5,518 English Wikipedia pages, automatically annotated through the GATE Platform for Anatomy, Disease, Drug and Investigation. Size of documents: between 200 and 2,000 tokens each.
The corpus is using the GATE XML format.
Languages
English (eng)
×
select
"Le Monde Diplomatique" Arabic tagged corpus
59 Mb
This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see al…
Arabic (ara)
ELRA-W0049
Details
"Le Monde Diplomatique" Arabic tagged corpus
Name
"Le Monde Diplomatique" Arabic tagged corpus (ELRA-W0049)
URL
http://catalog.elra.info/product_info.php?products_id=1096
Description
This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : raw text in Arabic, vowelized text in Arabic, one XML file containing the morphological annotation of the text.
Languages
Arabic (ara)
×
select
"Le Monde Diplomatique" Text corpus in Arabic
57 Mb
Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTM…
Arabic (ara)
ELRA-W0036-04
Details
"Le Monde Diplomatique" Text corpus in Arabic
Name
"Le Monde Diplomatique" Text corpus in Arabic (ELRA-W0036-04)
URL
http://catalog.elra.info/product_info.php?products_id=717
Description
Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTML file contains one article.
Languages
Arabic (ara)
×
select
"Le Monde Diplomatique" Text corpus in English
28 Mb
Electronic archiving of "Le Monde Diplomatique" articles in English from 1999. The corpus is available in HTML. Each HT…
English (eng)
ELRA-W0036-03
Details
"Le Monde Diplomatique" Text corpus in English
Name
"Le Monde Diplomatique" Text corpus in English (ELRA-W0036-03)
URL
http://catalog.elra.info/product_info.php?products_id=8
Description
Electronic archiving of "Le Monde Diplomatique" articles in English from 1999. The corpus is available in HTML. Each HTML file contains one article.
Languages
English (eng)
×
select
"Le Monde Diplomatique" Text corpus in French - archives 1980-1998
233 Mb
Electronic archiving of "Le Monde Diplomatique" articles in French from 1980 to 1998. The corpus is available in HTML. …
French (fre)
ELRA-W0036-01
Details
"Le Monde Diplomatique" Text corpus in French - archives 1980-1998
Name
"Le Monde Diplomatique" Text corpus in French - archives 1980-1998 (ELRA-W0036-01)
URL
http://catalog.elra.info/product_info.php?products_id=7
Description
Electronic archiving of "Le Monde Diplomatique" articles in French from 1980 to 1998. The corpus is available in HTML. Each HTML file contains one article.
Languages
French (fre)
×
select
"Le Monde Diplomatique" Text corpus in French - archives from 1999
90 Mb
Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTM…
French (fre)
ELRA-W0036-02
Details
"Le Monde Diplomatique" Text corpus in French - archives from 1999
Name
"Le Monde Diplomatique" Text corpus in French - archives from 1999 (ELRA-W0036-02)
URL
http://catalog.elra.info/product_info.php?products_id=9
Description
Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTML file contains one article.
Languages
French (fre)
×
select
LT Corpus
43 Mb
The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. T…
Portuguese (por)
ELRA-W0059
Details
LT Corpus
Name
LT Corpus (ELRA-W0059)
URL
http://catalog.elra.info/product_info.php?products_id=1178
Description
The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. The texts date from before 1940. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
Languages
Portuguese (por)
×
select
MLCC Multilingual and Parallel Corpora
915 Mb
The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The Fin…
Dutch, Flemish (dut); English (eng); Fren…
ELRA-W0023
Details
MLCC Multilingual and Parallel Corpora
Name
MLCC Multilingual and Parallel Corpora (ELRA-W0023)
URL
http://catalog.elra.info/product_info.php?products_id=764
Description
The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The Financial Times (English, 30 million words), Le Monde (French, 10 million words), Handelsblatt (German, 33 million words), Il sole 24 Ore (Italian, 1.88 million words), Expansion (Spanish, 10 million words).
The second set consists of a parallel corpus of translated data in the nine European official languages (1992-1994) divided into 2 sub-corpora: written questions (10.2 million words) and parliamentary debates (5 to 8 million words per language).
Languages
Dutch, Flemish (dut)
English (eng)
French (fre)
German (ger)
Italian (ita)
Spanish, Castilian (spa)
×
select
Modern French Corpus including Anaphors Tagging
13 Mb
This modern French corpus contains over 1 million words with a tagging of the anaphors, and cover many different aspect…
French (fre)
ELRA-W0032
Details
Modern French Corpus including Anaphors Tagging
Name
Modern French Corpus including Anaphors Tagging (ELRA-W0032)
URL
http://catalog.elra.info/product_info.php?products_id=634
Description
This modern French corpus contains over 1 million words with a tagging of the anaphors, and cover many different aspects of the French language (scientific and human sciences articles, extracts from newspapers and magazines, legal texts, etc.). The annotation scheme was defined in XML.
Languages
French (fre)
×
select
Monolingual Greek corpus
5.1 Mb
Corpus of 1 million words consisting of articles written in 1996 from the Greek daily newspaper ELEFTHEROTIPIA.
Greek, Modern (1453-) (gre)
…
ELRA-W0014
Details
Monolingual Greek corpus
×
select
MTP Annotated German corpus - tagged version
35 Mb
A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung an…
German (ger)
ELRA-W0008-02
Details
MTP Annotated German corpus - tagged version
Name
MTP Annotated German corpus - tagged version (ELRA-W0008-02)
URL
http://catalog.elra.info/product_info.php?products_id=480
Description
A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung and Die Zeit, for the years 1990 to 1992.
Languages
German (ger)
×
select
MTP Annotated German corpus - untagged version
283 Mb
A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung an…
German (ger)
ELRA-W0008-01
Details
MTP Annotated German corpus - untagged version
Name
MTP Annotated German corpus - untagged version (ELRA-W0008-01)
URL
http://catalog.elra.info/product_info.php?products_id=47
Description
A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung and Die Zeit, for the years 1990 to 1992.
Languages
German (ger)
×
select
MULTEXT JOC Corpus
114 Mb
This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-…
English (eng); French (fre); German (ger)…
ELRA-W0017
Details
MULTEXT JOC Corpus
Name
MULTEXT JOC Corpus (ELRA-W0017)
URL
http://catalog.elra.info/product_info.php?products_id=534
Description
This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains ca. 5 million words in English, French, German, Italian and Spanish (ca. 1 million words par language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.
Languages
English (eng)
French (fre)
German (ger)
Italian (ita)
Spanish, Castilian (spa)
×
select
Multilingual Corpus
9.9 Mb
Multilingual parallel corpus produced by Kaist Korterm containing 60 000 expressions in Korean, Chinese and English.
Chinese (chi); English (eng); Korean (kor…
ELRA-W0035
Details
select
NE3L named entities Arabic corpus
3 Mb
The Arabic corpus contains 103,363 words coming from articles extracted from Le Monde Diplomatique newspaper, and publi…
Arabic (ara)
ELRA-W0078
Details
NE3L named entities Arabic corpus
Name
NE3L named entities Arabic corpus (ELRA-W0078)
URL
http://catalog.elra.info/product_info.php?products_id=1226
Description
The Arabic corpus contains 103,363 words coming from articles extracted from Le Monde Diplomatique newspaper, and published in 2004. 2 named entity categories were taken into account: Time and Amount.
Languages
Arabic (ara)
×
select
NE3L named entities Chinese corpus
4.8 Mb
The Chinese corpus contains 79,302 words coming from articles extracted from Le Monde Diplomatique newspaper, and publi…
Chinese (chi)
ELRA-W0079
Details
NE3L named entities Chinese corpus
Name
NE3L named entities Chinese corpus (ELRA-W0079)
URL
http://catalog.elra.info/product_info.php?products_id=1227
Description
The Chinese corpus contains 79,302 words coming from articles extracted from Le Monde Diplomatique newspaper, and published in 2001. 3 named entity categories were taken into account: Person, Place and Organisation.
Languages
Chinese (chi)
×
select
NE3L named entities Russian corpus
2.7 Mb
The Russian corpus contains 75,784 words coming from articles extracted from Izvestia newspaper, and published in 1995.…
Russian (rus)
ELRA-W0080
Details
NE3L named entities Russian corpus
Name
NE3L named entities Russian corpus (ELRA-W0080)
URL
http://catalog.elra.info/product_info.php?products_id=1228
Description
The Russian corpus contains 75,784 words coming from articles extracted from Izvestia newspaper, and published in 1995. 2 named entity categories were taken into account: Time and Amount.
Languages
Russian (rus)
×
select
NEMLAR Written Corpus
136 Mb
The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is pr…
Arabic (ara)
ELRA-W0042
Details
NEMLAR Written Corpus
Name
NEMLAR Written Corpus (ELRA-W0042)
URL
http://catalog.elra.info/product_info.php?products_id=873
Description
The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is provided in 4 different versions: raw text, fully vowelized text, text with Arabic lexical analysis, text with Arabic POS-tags.
Languages
Arabic (ara)
×
select
Nepali Monolingual written corpus
683 Mb
The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (…
Nepali (nep)
ELRA-W0076
Details
Nepali Monolingual written corpus
Name
Nepali Monolingual written corpus (ELRA-W0076)
URL
http://catalog.elra.info/product_info.php?products_id=1216
Description
The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (CS) represents the collection of Nepali written texts from 15 different genres with 2000 words each published between 1990 and 1992. It is based on FLOB/FROWN corpora and contains 802,000 words. The general corpus (GC) consists of written texts collected opportunistically from a wide range of sources such as the internet webs, newspapers, books, publishers and authors. It contains 1,400,000 words.
Languages
Nepali (nep)
×
select
NPChunks
412 Kb
NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randoml…
Portuguese (por)
ELRA-W0089
Details
NPChunks
Name
NPChunks (ELRA-W0089)
URL
http://catalog.elra.info/product_info.php?products_id=1256
Description
NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randomly from the written part of the CINTIL corpus. The corpus is PoS-annotated at token level, including punctuation. Noun Phrases were annotated with specific tags. It was automatically PoS-tagged with MBT tagger, and lemmatized with MBLEM, following the annotation scheme of the Corpus of Reference of Contemporary Portuguese.
Languages
Portuguese (por)
×
select
PANACEA English-French and English-Greek parallel corpus acquired for Environment domain
11 Mb
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Environment doma…
English (eng); French (fre)
…
ELRA-W0057
Details
PANACEA English-French and English-Greek parallel corpus acquired for Environment domain
Name
PANACEA English-French and English-Greek parallel corpus acquired for Environment domain (ELRA-W0057)
URL
http://catalog.elra.info/product_info.php?products_id=1182
Description
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Environment domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets.
Languages
English (eng)
French (fre)
×
select
PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain
16 Mb
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Labour Legislati…
English (eng); Greek, Modern (1453-) (gre…
ELRA-W0058
Details
PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain
Name
PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain (ELRA-W0058)
URL
http://catalog.elra.info/product_info.php?products_id=1183
Description
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Labour Legislation domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets.
Languages
English (eng)
Greek, Modern (1453-) (gre)
×
select
PANACEA Environment English monolingual corpus
2.7 Gb
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English lan…
English (eng)
ELRA-W0063
Details
PANACEA Environment English monolingual corpus
Name
PANACEA Environment English monolingual corpus (ELRA-W0063)
URL
http://catalog.elra.info/product_info.php?products_id=1184
Description
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 50,541,538 tokens, divided into a total of 28,071 documents that were crawled from 3,121 web sites.
Languages
English (eng)
×
select
PANACEA Environment French monolingual corpus
2.1 Gb
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French lang…
French (fre)
ELRA-W0065
Details
PANACEA Environment French monolingual corpus
Name
PANACEA Environment French monolingual corpus (ELRA-W0065)
URL
http://catalog.elra.info/product_info.php?products_id=1186
Description
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 47,364,125 tokens, divided into a total of 23,514 documents that were crawled from 1,969 web sites.
Languages
French (fre)
×
select
PANACEA Environment Greek monolingual corpus
2 Gb
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek langu…
Greek, Modern (1453-) (gre)
…
ELRA-W0067
Details
PANACEA Environment Greek monolingual corpus
Name
PANACEA Environment Greek monolingual corpus (ELRA-W0067)
URL
http://catalog.elra.info/product_info.php?products_id=1188
Description
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 27,958,530 tokens, divided into a total of 16,073 documents that were crawled from 1,063 web sites.
Languages
Greek, Modern (1453-) (gre)
×
select
PANACEA Environment Italian monolingual corpus
1.8 Gb
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian lan…
Italian (ita)
ELRA-W0069
Details
PANACEA Environment Italian monolingual corpus
Name
PANACEA Environment Italian monolingual corpus (ELRA-W0069)
URL
http://catalog.elra.info/product_info.php?products_id=1190
Description
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 40,044,852 tokens, divided into a total of 16,159 documents that were crawled from 1,211 web sites.
Languages
Italian (ita)
×
select
PANACEA Environment Spanish monolingual corpus
2.3 Gb
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish lan…
Spanish, Castilian (spa)
ELRA-W0071
Details
PANACEA Environment Spanish monolingual corpus
Name
PANACEA Environment Spanish monolingual corpus (ELRA-W0071)
URL
http://catalog.elra.info/product_info.php?products_id=1192
Description
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 46,225,624 tokens, divided into a total of 26,009 documents that were crawled from 2,053 web sites.
Languages
Spanish, Castilian (spa)
×
select
PANACEA Labour English monolingual corpus
1.6 Gb
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English lan…
English (eng)
ELRA-W0064
Details
PANACEA Labour English monolingual corpus
Name
PANACEA Labour English monolingual corpus (ELRA-W0064)
URL
http://catalog.elra.info/product_info.php?products_id=1185
Description
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 46,431,351 tokens, divided into a total of 15,197 documents that were crawled from 1,558 web sites.
Languages
English (eng)
×
select
PANACEA Labour French monolingual corpus
2.5 Gb
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French lang…
French (fre)
ELRA-W0066
Details
PANACEA Labour French monolingual corpus
Name
PANACEA Labour French monolingual corpus (ELRA-W0066)
URL
http://catalog.elra.info/product_info.php?products_id=1187
Description
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 56,440,425 tokens, divided into a total of 26,675 documents that were crawled from 1,391 web sites.
Languages
French (fre)
×
select
PANACEA Labour Greek monolingual corpus
1.4 Gb
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek langu…
Greek, Modern (1453-) (gre)
…
ELRA-W0068
Details
PANACEA Labour Greek monolingual corpus
Name
PANACEA Labour Greek monolingual corpus (ELRA-W0068)
URL
http://catalog.elra.info/product_info.php?products_id=1189
Description
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 21,077,196 tokens, divided into a total of 7,124 documents that were crawled from 598 web sites.
Languages
Greek, Modern (1453-) (gre)
×
select
PANACEA Labour Italian monolingual corpus
2.4 Gb
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian lan…
Italian (ita)
ELRA-W0070
Details
PANACEA Labour Italian monolingual corpus
Name
PANACEA Labour Italian monolingual corpus (ELRA-W0070)
URL
http://catalog.elra.info/product_info.php?products_id=1191
Description
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 70,563,320 tokens, divided into a total of 12,706 documents that were crawled from 864 web sites.
Languages
Italian (ita)
×
select
PANACEA Labour Spanish monolingual corpus
1.9 Gb
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish lan…
Spanish, Castilian (spa)
ELRA-W0072
Details
PANACEA Labour Spanish monolingual corpus
Name
PANACEA Labour Spanish monolingual corpus (ELRA-W0072)
URL
http://catalog.elra.info/product_info.php?products_id=1193
Description
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 53,922,118 tokens, divided into a total of 13,188 documents that were crawled from 1,015 web sites.
Languages
Spanish, Castilian (spa)
×
select
PAROLE French Corpus
349 Mb
The PAROLE French corpus contains the following data:
Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual …
French (fre)
ELRA-W0020
Details
PAROLE French Corpus
Name
PAROLE French Corpus (ELRA-W0020)
URL
http://catalog.elra.info/product_info.php?products_id=565
Description
The PAROLE French corpus contains the following data:
Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual and Parallel Corpora) 2 025 964 words
Books: CNRS Editions 3 267 409 words
Periodicals: CNRS Info, Hermès 942 963 words
Newspapers: Le Monde,
provided by ELRA 13 856 763 words
Total 20 093 099 words
Languages
French (fre)
×
select
PAROLE Irish Distributable Corpus
25 Mb
This corpus consists of over 8 million words The text is marked-up in accordance with the PAROLE encoding standard. All…
Irish (gle)
ELRA-W0026
Details
PAROLE Irish Distributable Corpus
Name
PAROLE Irish Distributable Corpus (ELRA-W0026)
URL
http://catalog.elra.info/product_info.php?products_id=597
Description
This corpus consists of over 8 million words The text is marked-up in accordance with the PAROLE encoding standard. All the files are in SGML format with a detailed header and the body of the text tagged to paragraph level. A subset of the corpus is morpho-syntactically tagged. Included in this distribution is approximately 3,000 manually checked words.
Languages
Irish (gle)
×
select
PAROLE Italian Corpus
44 Mb
The PAROLE Italian Corpus comprises 3,135,651 words collected from four different domains: newspapers (2,179,800 words)…
Italian (ita)
ELRA-W0043
Details
PAROLE Italian Corpus
Name
PAROLE Italian Corpus (ELRA-W0043)
URL
http://catalog.elra.info/product_info.php?products_id=886
Description
The PAROLE Italian Corpus comprises 3,135,651 words collected from four different domains: newspapers (2,179,800 words), periodicals (143,810 words), books (564,964 words), miscellaneous (247,077 words). About 250,000 words were morphosyntactically annotated and lemmatized.
Languages
Italian (ita)
×
select
PAROLE Portuguese Corpus - complete version
57 Mb
The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Mediu…
Portuguese (por)
ELRA-W0024-01
Details
PAROLE Portuguese Corpus - complete version
Name
PAROLE Portuguese Corpus - complete version (ELRA-W0024-01)
URL
http://catalog.elra.info/product_info.php?products_id=765
Description
The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Medium (Newspaper, Book, Periodical, Miscellaneous).
The corpus was classified and encoded according to the common core parole encoding standard. The file format of this corpus is SGML.
Also availabe, a subcorpus consists of about 250,000 words morpho-syntactically tagged. Disambiguation was manually checked.
Languages
Portuguese (por)
×
select
Persian 1984 corpus (Multext-East framework)
5.9 Mb
This corpus contains the Persian (Farsi) translation of a part of the novel 1984 (G. Orwell) annotated in the Multext-E…
Persian (per)
ELRA-W0054
Details
Persian 1984 corpus (Multext-East framework)
Name
Persian 1984 corpus (Multext-East framework) (ELRA-W0054)
URL
http://catalog.elra.info/product_info.php?products_id=1124
Description
This corpus contains the Persian (Farsi) translation of a part of the novel 1984 (G. Orwell) annotated in the Multext-East framework (Multilingual Text Tools and Corpora for Eastern and Central European Languages). The corpus contains approximately 100,000 words (6,604 sentences, 13,247 lemmas), with extensive headers and markup for document structure, sentences, and various sub-sentence annotations in the XML-format following the TEI guidelines. Annotation includes POS (part-of-speech) and lemmas.
Languages
Persian (per)
×
select
PRESS 65
6.3 Mb
Over 1 million running words taken from Swedish newspapers from year 65.
Swedish (swe)
ELRA-W0010
Details
select
PTPARL Corpus
25 Mb
The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The …
Portuguese (por)
ELRA-W0060
Details
PTPARL Corpus
Name
PTPARL Corpus (ELRA-W0060)
URL
http://catalog.elra.info/product_info.php?products_id=1179
Description
The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The corpus contains 1,000,441 tokens. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
Languages
Portuguese (por)
×
select
Quaero Old Press Extended Named Entity corpus
6.8 Gb
This corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French …
French (fre)
ELRA-W0073
Details
Quaero Old Press Extended Named Entity corpus
Name
Quaero Old Press Extended Named Entity corpus (ELRA-W0073)
URL
http://catalog.elra.info/product_info.php?products_id=1194
Description
This corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French National Library (Bibliothèque Nationale de France). Three different titles are used (Le Temps, La Croix and Le Figaro) for a total of 295 pages. The corpus is fully manually annotated according to the Quaero extended and structured named entity definition.
Languages
French (fre)
×
select
Qualified POS Tagged Corpus
66 Mb
Monolingual corpus in a .txt format, produced by KAIST KORTERM, containing 1020000 eojeols (Korean terms) in Korean. Th…
Korean (kor)
ELRA-W0034
Details
Qualified POS Tagged Corpus
Name
Qualified POS Tagged Corpus (ELRA-W0034)
URL
http://catalog.elra.info/product_info.php?products_id=654
Description
Monolingual corpus in a .txt format, produced by KAIST KORTERM, containing 1020000 eojeols (Korean terms) in Korean. This corpus is morphologically analyzed, POS tagged, and rectified 3 times by specialists.
Languages
Korean (kor)
×
select
ROCO Romanian journalistic corpus
729 Mb
ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. …
Romanian (rum)
ELRA-W0085
Details
ROCO Romanian journalistic corpus
Name
ROCO Romanian journalistic corpus (ELRA-W0085)
URL
http://catalog.elra.info/product_info.php?products_id=1249
Description
ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. It is rich in proper names, numerals and named entities. The corpus has been lemmatized and PoS annotated following the Multext-East morphosyntactic specifications, and it is XML encoded.
Languages
Romanian (rum)
×
select
ROMBAC - Romanian balanced corpus
1.1 Gb
ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, me…
Romanian (rum)
ELRA-W0088
Details
ROMBAC - Romanian balanced corpus
Name
ROMBAC - Romanian balanced corpus (ELRA-W0088)
URL
http://catalog.elra.info/product_info.php?products_id=1253
Description
ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, medicine and biographical data for Romanian literary personalities. The entire corpus counts around 41,000,000 words, including punctuation. The corpus is annotated at paragraph, sentence, constituent group and word levels, and it provides morpho-syntactic information (MSD). It is xml encoded.
Languages
Romanian (rum)
×
select
Tagged text in French (MEMODATA) with rules of morphological disambiguation
3.1 Gb
More than 170 books (classical novels, legal texts...) are tagged with rules of morphological disambiguation. A tagged …
French (fre)
ELRA-W0012
Details
Tagged text in French (MEMODATA) with rules of morphological disambiguation
Name
Tagged text in French (MEMODATA) with rules of morphological disambiguation (ELRA-W0012)
URL
http://catalog.elra.info/product_info.php?products_id=50
Description
More than 170 books (classical novels, legal texts...) are tagged with rules of morphological disambiguation. A tagged corpus of 50 books is available for research. It consists of several authors of the 19th century (Balzac, Hugo, Stendhal).
See also W0011.
Languages
French (fre)
×
select
Tagged text in French (MEMODATA) with typographic tags
247 Mb
Over 170 (tagged) French books (classical novels, legal texts) with typographic tags. Another tagged corpus of 50 books…
French (fre)
ELRA-W0011
Details
Tagged text in French (MEMODATA) with typographic tags
Name
Tagged text in French (MEMODATA) with typographic tags (ELRA-W0011)
URL
http://catalog.elra.info/product_info.php?products_id=49
Description
Over 170 (tagged) French books (classical novels, legal texts) with typographic tags. Another tagged corpus of 50 books is available for research only. The books consist of authors of the 19th century.
See also W0012.
Languages
French (fre)
×
select
Text corpus of "Le Monde" (1987-2012)
3.9 Gb
Corpus from "Le Monde" newspaper. Each year contains some 10 Mbytes of data per month (circa 120 Mbytes per year). Data…
French (fre)
ELRA-W0015
Details
Text corpus of "Le Monde" (1987-2012)
Name
Text corpus of "Le Monde" (1987-2012) (ELRA-W0015)
URL
http://catalog.elra.info/product_info.php?products_id=438
Description
Corpus from "Le Monde" newspaper. Each year contains some 10 Mbytes of data per month (circa 120 Mbytes per year). Data ranging from 1987 until 2012 are available (total 1,199,143 articles).
Languages
French (fre)
×
select
The CINTIL Corpus International Corpus of Portuguese
20 Mb
CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portugue…
Portuguese (por)
ELRA-W0050
Details
The CINTIL Corpus International Corpus of Portuguese
Name
The CINTIL Corpus International Corpus of Portuguese (ELRA-W0050)
URL
http://catalog.elra.info/product_info.php?products_id=1102
Description
CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). The corpus is developed over raw textual materials of several types, of which 30% are spoken materials.
Languages
Portuguese (por)
×
select
The EMILLE/CIIL Corpus
1.5 Gb
The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian lan…
Urdu (urd); Telugu (tel); Tamil (tam); Si…
ELRA-W0037
Details
The EMILLE/CIIL Corpus
Name
The EMILLE/CIIL Corpus (ELRA-W0037)
URL
http://catalog.elra.info/product_info.php?products_id=696
Description
The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.
This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus.
Languages
Urdu (urd)
Telugu (tel)
Tamil (tam)
Sinhalese (sin)
Panjabi, Punjabi (pan)
Oriya (ori)
Marathi (mar)
Malayalam (mal)
Kashmiri (kas)
Kannada (kan)
Hindi (hin)
Gujarati (guj)
Bengali (ben)
Assamese (asm)
×
select
The Lancaster Corpus of Mandarin Chinese (LCMC)
45 Mb
The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written text categories including news, literary texts, acad…
Chinese (chi)
ELRA-W0039
Details
The Lancaster Corpus of Mandarin Chinese (LCMC)
Name
The Lancaster Corpus of Mandarin Chinese (LCMC) (ELRA-W0039)
URL
http://catalog.elra.info/product_info.php?products_id=715
Description
The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written text categories including news, literary texts, academic prose and official documents etc published in P. R. China in the earlier 1990s for a total of approximately 1 million words. The same sampling frame and period as FLOB/FROWN were used in LCMC. The corpus is encoded in Unicode (UTF-8) and marked up in XML.
Languages
Chinese (chi)
×
select
TRAD Pashto-English News Articles Parallel corpus
602 Kb
This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. The…
English (eng); Pushto (pus)
…
ELRA-W0097
Details
TRAD Pashto-English News Articles Parallel corpus
Name
TRAD Pashto-English News Articles Parallel corpus (ELRA-W0097)
URL
http://catalog.elra.info/product_info.php?products_id=1271
Description
This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto.
Languages
English (eng)
Pushto (pus)
×
select
TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data
575 Kb
This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 bro…
English (eng); Pushto (pus)
…
ELRA-W0095
Details
TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data
Name
TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data (ELRA-W0095)
URL
http://catalog.elra.info/product_info.php?products_id=1269
Description
This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381).
Languages
English (eng)
Pushto (pus)
×
select
TRAD Pashto-French News Articles Parallel corpus
970 Kb
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The …
French (fre); Pushto (pus)
…
ELRA-W0096
Details
TRAD Pashto-French News Articles Parallel corpus
Name
TRAD Pashto-French News Articles Parallel corpus (ELRA-W0096)
URL
http://catalog.elra.info/product_info.php?products_id=1270
Description
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto.
Languages
French (fre)
Pushto (pus)
×
select
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data
29 Mb
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The …
French (fre); Pushto (pus)
…
ELRA-W0094
Details
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data
Name
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data (ELRA-W0094)
URL
http://catalog.elra.info/product_info.php?products_id=1268
Description
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381).
Languages
French (fre)
Pushto (pus)
×
select
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data
473 Mb
This corpus consists of the transcription of 106 hours of recordings in Pashto from the TRAD Pashto Broadcast News Spee…
French (fre); Pushto (pus)
…
ELRA-W0093
Details
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data
Name
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data (ELRA-W0093)
URL
http://catalog.elra.info/product_info.php?products_id=1267
Description
This corpus consists of the transcription of 106 hours of recordings in Pashto from the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381) translated into French. It contains about 832,000 source words and 747,000 target words.
Languages
French (fre)
Pushto (pus)
×
select
TRAD Pashto Monolingual text Corpus
2.2 Gb
This is a monolingual text corpus in Pashto. The corpus contains about 112,000,000 tokens collected from 46 different b…
Pushto (pus)
ELRA-W0092
Details
TRAD Pashto Monolingual text Corpus
×
select
TSNLP (Test Suites for NLP Testing)
4.5 Mb
Test Suites for Natural Language Processing.
4,000 test items (sentences or fragments of sentences) in English, French…
English (eng); French (fre); German (ger)…
ELRA-W0013
Details
TSNLP (Test Suites for NLP Testing)
Name
TSNLP (Test Suites for NLP Testing) (ELRA-W0013)
URL
http://catalog.elra.info/product_info.php?products_id=51
Description
Test Suites for Natural Language Processing.
4,000 test items (sentences or fragments of sentences) in English, French & German, useful for NL system evaluation.
Languages
English (eng)
French (fre)
German (ger)
×
select
Venice Italian Treebank (VIT)
149 Mb
The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, pol…
Italian (ita)
ELRA-W0040
Details
Venice Italian Treebank (VIT)
Name
Venice Italian Treebank (VIT) (ELRA-W0040)
URL
http://catalog.elra.info/product_info.php?products_id=831
Description
The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, political, economic and financial, literary, scientific, and news. In addition, some 60,000 tokens of spoken dialogues in different Italian varieties were annotated.
The annotation follows general X-bar criteria with 29 constituency labels and 102 PoS tags. VIT is also made available in a broad annotation version with 10 constituency labels and 22 PoS tags for machine learning purposes. The format is plain text with square bracketing. However, a UPenn style version which is readable by the open source query language CorpusSearch is also provided.
Languages
Italian (ita)
×
select
Wolverhampton Business English Corpus
118 Mb
Produced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the frame…
English (eng)
ELRA-W0028
Details
Wolverhampton Business English Corpus
Name
Wolverhampton Business English Corpus (ELRA-W0028)
URL
http://catalog.elra.info/product_info.php?products_id=627
Description
Produced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsPProduced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsP&P (Language Resources Production & Packaging - LE4-8335), the Business English Corpus consists of 10.186.259 words collected from 23 different Web sites related to business.
Languages
English (eng)
×
select