Menu
Cookies disclaimer

Our site saves small pieces of text information (cookies) on your device in order to deliver better content and for statistical purposes. You can disable the usage of cookies by changing the settings of your browser. By browsing our website without changing the browser settings you grant us permission to store that information on your device.

I agree

List of Corpora

Name Size Description Language ELRA Details Your selection
"Le Monde Diplomatique" Arabic tagged corpus 59 Mb This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see al… Arabic (ara) ELRA-W0049 Details
"Le Monde Diplomatique" Arabic tagged corpus
Name "Le Monde Diplomatique" Arabic tagged corpus (ELRA-W0049)
URL http://catalog.elra.info/product_info.php?products_id=1096
Description This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : raw text in Arabic, vowelized text in Arabic, one XML file containing the morphological annotation of the text.
Languages Arabic (ara)
"Le Monde Diplomatique" Text corpus in Arabic 57 Mb Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTM… Arabic (ara) ELRA-W0036-04 Details
"Le Monde Diplomatique" Text corpus in Arabic
Name "Le Monde Diplomatique" Text corpus in Arabic (ELRA-W0036-04)
URL http://catalog.elra.info/product_info.php?products_id=717
Description Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTML file contains one article.
Languages Arabic (ara)
"Le Monde Diplomatique" Text corpus in English 28 Mb Electronic archiving of "Le Monde Diplomatique" articles in English from 1999. The corpus is available in HTML. Each HT… English (eng) ELRA-W0036-03 Details
"Le Monde Diplomatique" Text corpus in English
Name "Le Monde Diplomatique" Text corpus in English (ELRA-W0036-03)
URL http://catalog.elra.info/product_info.php?products_id=8
Description Electronic archiving of "Le Monde Diplomatique" articles in English from 1999. The corpus is available in HTML. Each HTML file contains one article.
Languages English (eng)
"Le Monde Diplomatique" Text corpus in French - archives 1980-1998 233 Mb Electronic archiving of "Le Monde Diplomatique" articles in French from 1980 to 1998. The corpus is available in HTML. … French (fre) ELRA-W0036-01 Details
"Le Monde Diplomatique" Text corpus in French - archives 1980-1998
Name "Le Monde Diplomatique" Text corpus in French - archives 1980-1998 (ELRA-W0036-01)
URL http://catalog.elra.info/product_info.php?products_id=7
Description Electronic archiving of "Le Monde Diplomatique" articles in French from 1980 to 1998. The corpus is available in HTML. Each HTML file contains one article.
Languages French (fre)
"Le Monde Diplomatique" Text corpus in French - archives from 1999 90 Mb Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTM… French (fre) ELRA-W0036-02 Details
"Le Monde Diplomatique" Text corpus in French - archives from 1999
Name "Le Monde Diplomatique" Text corpus in French - archives from 1999 (ELRA-W0036-02)
URL http://catalog.elra.info/product_info.php?products_id=9
Description Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTML file contains one article.
Languages French (fre)
2006 CoNLL Shared Task - Ten Languages 85.2 Mb 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006… Turkish (tur); Bulgarian (bul); Dutch, Fl… ELRA-W0086 Details
2006 CoNLL Shared Task - Ten Languages
Name 2006 CoNLL Shared Task - Ten Languages (ELRA-W0086)
URL http://catalog.elra.info/product_info.php?products_id=1250
Description 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish. The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format.
Languages
  • Turkish (tur)
  • Bulgarian (bul)
  • Dutch, Flemish (dut)
  • German (ger)
  • Japanese (jpn)
  • Spanish, Castilian (spa)
  • Danish (dan)
  • Portuguese (por)
  • Swedish (swe)
  • Slovenian (slv)
A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version 23 Mb Produced through a funding from ELRA in the framework of the European Commission project LRsPProduced through a funding… French (fre) ELRA-W0025-02 Details
A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version
Name A "scientific" corpus of modern French ("La Recherche" magazine) - Complete version (ELRA-W0025-02)
URL http://catalog.elra.info/product_info.php?products_id=595
Description Produced through a funding from ELRA in the framework of the European Commission project LRsPProduced through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335), the corpus contains all articles published in La Recherche magazine in 1998, including issues 305 (January) to 315 (December), which amounts to 447,244 tokens and 30,238 types. Two versions are available: the raw data (XML format) and the complete version (XML and SGML formats)
Languages French (fre)
ARCADE/ROMANSEVAL corpus 63 Mb The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (… English (eng); French (fre); Italian (ita… ELRA-W0018 Details
ARCADE/ROMANSEVAL corpus
Name ARCADE/ROMANSEVAL corpus (ELRA-W0018)
URL http://catalog.elra.info/product_info.php?products_id=535
Description The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission). The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3700 contexts all together. It comprises: semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian; and word-level alignment of all the occurrences of the test words between French and English.
Languages
  • English (eng)
  • French (fre)
  • Italian (ita)
Al-Hayat Arabic Corpus 1.1 Gb The corpus contains articles extracted from the newspeper Al-Hayat, organised in 7 domains, for language engineering ap… Arabic (ara) ELRA-W0030 Details
Al-Hayat Arabic Corpus
Name Al-Hayat Arabic Corpus (ELRA-W0030)
URL http://catalog.elra.info/product_info.php?products_id=632
Description The corpus contains articles extracted from the newspeper Al-Hayat, organised in 7 domains, for language engineering applications developement.
Languages Arabic (ara)
Amaryllis Corpus - Evaluation Package 505 Mb AMARYLLIS was organised by the Institut de l'Information Scientifique et Technique (INIST) with the support of the Agen… French (fre) ELRA-W0029 Details
Amaryllis Corpus - Evaluation Package
Name Amaryllis Corpus - Evaluation Package (ELRA-W0029)
URL http://catalog.elra.info/product_info.php?products_id=626
Description AMARYLLIS was organised by the Institut de l'Information Scientifique et Technique (INIST) with the support of the Agence francophone pour l'enseignement supérieur et la recherche (AUPELF-UREF) and the French Ministère de l'Education Nationale, de la Recherche et de la Technologie (MERT) to create document corpora, questions and answers, in the framework of the Action de Recherche Concertée (ARC A1, renamed as Amaryllis- Access to text information in French), in order to get similar works to the United States project TREC. All corpora are structured as SGML files with isolatin character-encoding.
Languages French (fre)
Amharic-English bilingual corpus 15 Mb The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in translite… English (eng); Amharic (amh) … ELRA-W0074 Details
Amharic-English bilingual corpus
Name Amharic-English bilingual corpus (ELRA-W0074)
URL http://catalog.elra.info/product_info.php?products_id=1215
Description The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in transliterated form and in English. The size of the corpus is of 232,653 words in Amharic and 291,701 in English.
Languages
  • English (eng)
  • Amharic (amh)
An-Nahar Newspaper Text Corpus 794 Mb The An-Nahar Newspaper Text Corpus comprises articles in Arabic (Lebanon) from 1995 to 2000 (6 years) stored as HTML fi… Arabic (ara) ELRA-W0027 Details
An-Nahar Newspaper Text Corpus
Name An-Nahar Newspaper Text Corpus (ELRA-W0027)
URL http://catalog.elra.info/product_info.php?products_id=767
Description The An-Nahar Newspaper Text Corpus comprises articles in Arabic (Lebanon) from 1995 to 2000 (6 years) stored as HTML files onCDRommedia. Each yearcontains 45000 articles and 24 million words.
Languages Arabic (ara)
Arboretum treebank 26 Mb The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of … Danish (dan) ELRA-W0084 Details
Arboretum treebank
Name Arboretum treebank (ELRA-W0084)
URL http://catalog.elra.info/product_info.php?products_id=1248
Description The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of about 425,000 tokens and there are ca. 22,260 sentences/utterances containing 3 or more tokens. Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions: 1. Native dependency format (Constraint Grammar format) 2. Dependency annotation converted to MALT xml format 3. Native constituent tree format (Cross-language VISL standard) 4. Constituent format converted to TIGER xml
Languages Danish (dan)
CINTIL-DeepBank 213 Mb The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical r… Portuguese (por) ELRA-W0062 Details
CINTIL-DeepBank
Name CINTIL-DeepBank (ELRA-W0062)
URL http://catalog.elra.info/product_info.php?products_id=1181
Description The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical representations, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.
Languages Portuguese (por)
CINTIL-DependencyBank 1.4 Mb The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency g… Portuguese (por) ELRA-W0061 Details
CINTIL-DependencyBank
Name CINTIL-DependencyBank (ELRA-W0061)
URL http://catalog.elra.info/product_info.php?products_id=1180
Description The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency graphs and grammatical function tags composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus.
Languages Portuguese (por)
CINTIL-PropBank 3.6 Mb The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, compos… Portuguese (por) ELRA-W0056 Details
CINTIL-PropBank
Name CINTIL-PropBank (ELRA-W0056)
URL http://catalog.elra.info/product_info.php?products_id=1176
Description The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.
Languages Portuguese (por)
CINTIL-TreeBank 3.1 Mb The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 11… Portuguese (por) ELRA-W0055 Details
CINTIL-TreeBank
Name CINTIL-TreeBank (ELRA-W0055)
URL http://catalog.elra.info/product_info.php?products_id=1174
Description The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus.
Languages Portuguese (por)
CRATER 2 Corpus 359 Mb The CRATER 2 parallel corpus is an extension of the CRATER corpus, available in the catalogue under reference W0003. It… English (eng); French (fre); Spanish, Cas… ELRA-W0033 Details
CRATER 2 Corpus
Name CRATER 2 Corpus (ELRA-W0033)
URL http://catalog.elra.info/product_info.php?products_id=636
Description The CRATER 2 parallel corpus is an extension of the CRATER corpus, available in the catalogue under reference W0003. It consists of 1,500,000 tokens for English and French and of 1,000,000 tokens for Spanish, with morphosyntactical annotations. CRATER 2 (ref. ELRA-W0033) includes CRATER (ref. ELRA-W0003)
Languages
  • English (eng)
  • French (fre)
  • Spanish, Castilian (spa)
Catalan Corpus of News Articles 645 Mb The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007. These articles … Catalan, Valencian (cat) ELRA-W0047 Details
Catalan Corpus of News Articles
Name Catalan Corpus of News Articles (ELRA-W0047)
URL http://catalog.elra.info/product_info.php?products_id=990
Description The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007. These articles are grouped per trimester without chronological order inside.
Languages Catalan, Valencian (cat)
Catalan-Spanish Parallel Corpus 686 Mb This corpus contains more than 100 million words and it contains 10 years of bilingual articles from El Periódico de Ca… Spanish, Castilian (spa); Catalan, Valenc… ELRA-W0053 Details
Catalan-Spanish Parallel Corpus
Name Catalan-Spanish Parallel Corpus (ELRA-W0053)
URL http://catalog.elra.info/product_info.php?products_id=1122
Description This corpus contains more than 100 million words and it contains 10 years of bilingual articles from El Periódico de Catalunya. The data are aligned at sentence level and stored in text files, in a one sentence per line basis. The data are provided in plain text, with no encoding whatsoever.
Languages
  • Spanish, Castilian (spa)
  • Catalan, Valencian (cat)
Corpus of Contemporaneous Spanish Novels 4.8 Mb This corpus consists of 11 novels written in Castilian Spanish by Inmaculada Ferrer-Vidal Turull, a contemporaneous aut… Spanish, Castilian (spa) ELRA-W0041 Details
Corpus of Contemporaneous Spanish Novels
Name Corpus of Contemporaneous Spanish Novels (ELRA-W0041)
URL http://catalog.elra.info/product_info.php?products_id=847
Description This corpus consists of 11 novels written in Castilian Spanish by Inmaculada Ferrer-Vidal Turull, a contemporaneous author.
Languages Spanish, Castilian (spa)
Dutch PAROLE Distributable Corpus 70 Mb This Dutch corpus is a 3 million words selection built according to the specifications of the PAROLE project. Over 250,… Dutch, Flemish (dut) ELRA-W0019 Details
Dutch PAROLE Distributable Corpus
Name Dutch PAROLE Distributable Corpus (ELRA-W0019)
URL http://catalog.elra.info/product_info.php?products_id=543
Description This Dutch corpus is a 3 million words selection built according to the specifications of the PAROLE project. Over 250,000 words of corpus texts (with TEI markup suppressed) have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked.
Languages Dutch, Flemish (dut)
ECI-ELSNET Italian & German tagged sub-corpus 3 Mb The data is extracted from the ECI corpus (the German Frankfurter Rundschau part) and the Italian corpus of ILC/CNR. It… German (ger); Italian (ita) … ELRA-W0005 Details
ECI-ELSNET Italian & German tagged sub-corpus
Name ECI-ELSNET Italian & German tagged sub-corpus (ELRA-W0005)
URL http://catalog.elra.info/product_info.php?products_id=86
Description The data is extracted from the ECI corpus (the German Frankfurter Rundschau part) and the Italian corpus of ILC/CNR. It contains the following domains: Economy (17,000 words), Politics (14,000 words), Culture (18,000 words), Sports (9,000 words), Local Events (8,500 words).
Languages
  • German (ger)
  • Italian (ita)
ECI/MCI (European Corpus Initiative/Multilingual Corpus I) 655 Mb Over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, M… Turkish (tur); Albanian (alb); Bulgarian … ELRA-W0004 Details
ECI/MCI (European Corpus Initiative/Multilingual Corpus I)
Name ECI/MCI (European Corpus Initiative/Multilingual Corpus I) (ELRA-W0004)
URL http://catalog.elra.info/product_info.php?products_id=85
Description Over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more.
Languages
  • Turkish (tur)
  • Albanian (alb)
  • Bulgarian (bul)
  • Chinese (chi)
  • Czech (cze)
  • Dutch, Flemish (dut)
  • English (eng)
  • Estonian (est)
  • French (fre)
  • Gaelic, Scottish Gaelic (gla)
  • German (ger)
  • Greek, Modern (1453-) (gre)
  • Italian (ita)
  • Japanese (jpn)
  • Latin (lat)
  • Lithuanian (lit)
  • Malay (may)
  • Spanish, Castilian (spa)
  • Serbian (scc)
  • Danish (dan)
  • Russian (rus)
  • Norwegian (nor)
  • Uzbek (uzb)
  • Portuguese (por)
  • Swedish (swe)
EUROPARL Corpus Parallel Corpora: Portuguese-English 2.3 Gb The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It… English (eng); Portuguese (por) … ELRA-W0090 Details
EUROPARL Corpus Parallel Corpora: Portuguese-English
Name EUROPARL Corpus Parallel Corpora: Portuguese-English (ELRA-W0090)
URL http://catalog.elra.info/product_info.php?products_id=1257
Description The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It contains approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation). It is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file, containing a PoS tag and a lemma for each token.
Languages
  • English (eng)
  • Portuguese (por)
English-Nepali Parallel Corpus 47 Mb This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligne… English (eng); Nepali (nep) … ELRA-W0077 Details
English-Nepali Parallel Corpus
Name English-Nepali Parallel Corpus (ELRA-W0077)
URL http://catalog.elra.info/product_info.php?products_id=1217
Description This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligned at the sentence level (27,060 English words; 21,756 Nepali words), and a larger set of texts at the document level (617,340 English words; 596,571 Nepali words). An additional set of monolingual data in Nepali is also provided (386,879 words in Nepali).
Languages
  • English (eng)
  • Nepali (nep)
English-Persian parallel Corpus 40 Mb Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English and … English (eng); Persian (per) … ELRA-W0051 Details
English-Persian parallel Corpus
Name English-Persian parallel Corpus (ELRA-W0051)
URL http://catalog.elra.info/product_info.php?products_id=1111
Description Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English and Persian (Farsi) words aligned at sentence level (about 100,000 sentences). The format of the files is Unicode. It has been originally created with SQL Server, but it is presented in access file type.
Languages
  • English (eng)
  • Persian (per)
GeFRePaC - German French Reciprocal Parallel Corpus 1.3 Gb GeFRePac was produced in the framework of the LRsPGeFRePac was produced in the framework of the LRsP&P project. It cont… French (fre); German (ger) … ELRA-W0031 Details
GeFRePaC - German French Reciprocal Parallel Corpus
Name GeFRePaC - German French Reciprocal Parallel Corpus (ELRA-W0031)
URL http://catalog.elra.info/product_info.php?products_id=633
Description GeFRePac was produced in the framework of the LRsPGeFRePac was produced in the framework of the LRsP&P project. It contains 30 million words (15 million for each language) for the purpose of developing, enhancing and improving translation aids.
Languages
  • French (fre)
  • German (ger)
ICE-GB (British English component of the International Corpus of English) 97 Mb British component of the International Corpus of English (ICE), ICE-GB consists of a million words (83,394 parse trees,… English (eng) ELRA-W0021 Details
ICE-GB (British English component of the International Corpus of English)
Name ICE-GB (British English component of the International Corpus of English) (ELRA-W0021)
URL http://catalog.elra.info/product_info.php?products_id=762
Description British component of the International Corpus of English (ICE), ICE-GB consists of a million words (83,394 parse trees, including 59,640 in the spoken part of the corpus) extracted from 200 written and 300 spoken English texts. It is fully grammatically annotated and has been fully checked. ICE-GB is distributed with the retrieval software ICECUP (the International Corpus of English Corpus Utility Program).
Languages English (eng)
ILSP/ELEFTHEROTYPIA Corpus (Greek corpus) 27 Mb This corpus contains approximately 3 million words from the daily newspaper ELEFTHEROTYPIA, classified and annotated ac… Greek, Modern (1453-) (gre) … ELRA-W0022 Details
ILSP/ELEFTHEROTYPIA Corpus (Greek corpus)
Name ILSP/ELEFTHEROTYPIA Corpus (Greek corpus) (ELRA-W0022)
URL http://catalog.elra.info/product_info.php?products_id=763
Description This corpus contains approximately 3 million words from the daily newspaper ELEFTHEROTYPIA, classified and annotated accordingly to the common core PAROLE encoding standard. The format of the corpus is SGML files. A subset of the corpus (250,000 words) is morpho-syntactically tagged; all the words are also lemmatised and checked.
Languages Greek, Modern (1453-) (gre)
Italian Syntactic-Semantic Treebank (ISST) 90 Mb ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in … Italian (ita) ELRA-W0044 Details
Italian Syntactic-Semantic Treebank (ISST)
Name Italian Syntactic-Semantic Treebank (ISST) (ELRA-W0044)
URL http://catalog.elra.info/product_info.php?products_id=887
Description ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in XML. This Treebank has a five-level structure covering orthographic, morpho-syntactic, syntactic; semantic and lexico-semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the functional relations level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet (see ELRA-M0018) is the reference lexical resource used for the sense tagging task . Both syntactic and lexico-semantic annotations refer to the morpho-syntactically annotated text, which in turn is linked to the orthographic file with the text and mark-up of macrotextual organisation (e.g. titles, subtitles, summary, body of article, paragraphs).
Languages Italian (ita)
Karl May Korpus (KMK) 77 Mb Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of… German (ger) ELRA-W0016 Details
Karl May Korpus (KMK)
Name Karl May Korpus (KMK) (ELRA-W0016)
URL http://catalog.elra.info/product_info.php?products_id=450
Description Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May and consists of around 1.6 million words (divided into 9 sub-corpora of about 180,000 words each).
Languages German (ger)
Khresmoi manually annotated reference corpus 1.3 Gb This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The … English (eng) ELRA-W0081 Details
Khresmoi manually annotated reference corpus
Name Khresmoi manually annotated reference corpus (ELRA-W0081)
URL http://catalog.elra.info/product_info.php?products_id=1237
Description This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The corpus is divided into two parts: 1. The initial corpus: 625 documents from the Genetics Home Reference data set, automatically annotated with anatomical locations and diseases, and manually corrected by 3-4 annotators. Size of documents: between 26 and 8,306 tokens each. 2. The main corpus: 6,950 English documents from the Khresmoi crawl and 5,518 English Wikipedia pages, automatically annotated through the GATE Platform for Anatomy, Disease, Drug and Investigation. Size of documents: between 200 and 2,000 tokens each. The corpus is using the GATE XML format.
Languages English (eng)
LT Corpus 43 Mb The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. T… Portuguese (por) ELRA-W0059 Details
LT Corpus
Name LT Corpus (ELRA-W0059)
URL http://catalog.elra.info/product_info.php?products_id=1178
Description The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. The texts date from before 1940. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
Languages Portuguese (por)
MLCC Multilingual and Parallel Corpora 915 Mb The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The Fin… Dutch, Flemish (dut); English (eng); Fren… ELRA-W0023 Details
MLCC Multilingual and Parallel Corpora
Name MLCC Multilingual and Parallel Corpora (ELRA-W0023)
URL http://catalog.elra.info/product_info.php?products_id=764
Description The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The Financial Times (English, 30 million words), Le Monde (French, 10 million words), Handelsblatt (German, 33 million words), Il sole 24 Ore (Italian, 1.88 million words), Expansion (Spanish, 10 million words). The second set consists of a parallel corpus of translated data in the nine European official languages (1992-1994) divided into 2 sub-corpora: written questions (10.2 million words) and parliamentary debates (5 to 8 million words per language).
Languages
  • Dutch, Flemish (dut)
  • English (eng)
  • French (fre)
  • German (ger)
  • Italian (ita)
  • Spanish, Castilian (spa)
MTP Annotated German corpus - tagged version 35 Mb A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung an… German (ger) ELRA-W0008-02 Details
MTP Annotated German corpus - tagged version
Name MTP Annotated German corpus - tagged version (ELRA-W0008-02)
URL http://catalog.elra.info/product_info.php?products_id=480
Description A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung and Die Zeit, for the years 1990 to 1992.
Languages German (ger)
MTP Annotated German corpus - untagged version 283 Mb A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung an… German (ger) ELRA-W0008-01 Details
MTP Annotated German corpus - untagged version
Name MTP Annotated German corpus - untagged version (ELRA-W0008-01)
URL http://catalog.elra.info/product_info.php?products_id=47
Description A 500,000 German words corpus of SGML-formatted texts from two German newspapers, the Frankfurter Allgemeine Zeitung and Die Zeit, for the years 1990 to 1992.
Languages German (ger)
MULTEXT JOC Corpus 114 Mb This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-… English (eng); French (fre); German (ger)… ELRA-W0017 Details
MULTEXT JOC Corpus
Name MULTEXT JOC Corpus (ELRA-W0017)
URL http://catalog.elra.info/product_info.php?products_id=534
Description This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains ca. 5 million words in English, French, German, Italian and Spanish (ca. 1 million words par language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.
Languages
  • English (eng)
  • French (fre)
  • German (ger)
  • Italian (ita)
  • Spanish, Castilian (spa)
Modern French Corpus including Anaphors Tagging 13 Mb This modern French corpus contains over 1 million words with a tagging of the anaphors, and cover many different aspect… French (fre) ELRA-W0032 Details
Modern French Corpus including Anaphors Tagging
Name Modern French Corpus including Anaphors Tagging (ELRA-W0032)
URL http://catalog.elra.info/product_info.php?products_id=634
Description This modern French corpus contains over 1 million words with a tagging of the anaphors, and cover many different aspects of the French language (scientific and human sciences articles, extracts from newspapers and magazines, legal texts, etc.). The annotation scheme was defined in XML.
Languages French (fre)
Monolingual Greek corpus 5.1 Mb Corpus of 1 million words consisting of articles written in 1996 from the Greek daily newspaper ELEFTHEROTIPIA. Greek, Modern (1453-) (gre) … ELRA-W0014 Details
Monolingual Greek corpus
Name Monolingual Greek corpus (ELRA-W0014)
URL http://catalog.elra.info/product_info.php?products_id=716
Description Corpus of 1 million words consisting of articles written in 1996 from the Greek daily newspaper ELEFTHEROTIPIA.
Languages Greek, Modern (1453-) (gre)
Multilingual Corpus 9.9 Mb Multilingual parallel corpus produced by Kaist Korterm containing 60 000 expressions in Korean, Chinese and English. Chinese (chi); English (eng); Korean (kor… ELRA-W0035 Details
Multilingual Corpus
Name Multilingual Corpus (ELRA-W0035)
URL http://catalog.elra.info/product_info.php?products_id=655
Description Multilingual parallel corpus produced by Kaist Korterm containing 60 000 expressions in Korean, Chinese and English.
Languages
  • Chinese (chi)
  • English (eng)
  • Korean (kor)
NE3L named entities Arabic corpus 3 Mb The Arabic corpus contains 103,363 words coming from articles extracted from Le Monde Diplomatique newspaper, and publi… Arabic (ara) ELRA-W0078 Details
NE3L named entities Arabic corpus
Name NE3L named entities Arabic corpus (ELRA-W0078)
URL http://catalog.elra.info/product_info.php?products_id=1226
Description The Arabic corpus contains 103,363 words coming from articles extracted from Le Monde Diplomatique newspaper, and published in 2004. 2 named entity categories were taken into account: Time and Amount.
Languages Arabic (ara)
NE3L named entities Chinese corpus 4.8 Mb The Chinese corpus contains 79,302 words coming from articles extracted from Le Monde Diplomatique newspaper, and publi… Chinese (chi) ELRA-W0079 Details
NE3L named entities Chinese corpus
Name NE3L named entities Chinese corpus (ELRA-W0079)
URL http://catalog.elra.info/product_info.php?products_id=1227
Description The Chinese corpus contains 79,302 words coming from articles extracted from Le Monde Diplomatique newspaper, and published in 2001. 3 named entity categories were taken into account: Person, Place and Organisation.
Languages Chinese (chi)
NE3L named entities Russian corpus 2.7 Mb The Russian corpus contains 75,784 words coming from articles extracted from Izvestia newspaper, and published in 1995.… Russian (rus) ELRA-W0080 Details
NE3L named entities Russian corpus
Name NE3L named entities Russian corpus (ELRA-W0080)
URL http://catalog.elra.info/product_info.php?products_id=1228
Description The Russian corpus contains 75,784 words coming from articles extracted from Izvestia newspaper, and published in 1995. 2 named entity categories were taken into account: Time and Amount.
Languages Russian (rus)
NEMLAR Written Corpus 136 Mb The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is pr… Arabic (ara) ELRA-W0042 Details
NEMLAR Written Corpus
Name NEMLAR Written Corpus (ELRA-W0042)
URL http://catalog.elra.info/product_info.php?products_id=873
Description The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is provided in 4 different versions: raw text, fully vowelized text, text with Arabic lexical analysis, text with Arabic POS-tags.
Languages Arabic (ara)
NPChunks 412 Kb NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randoml… Portuguese (por) ELRA-W0089 Details
NPChunks
Name NPChunks (ELRA-W0089)
URL http://catalog.elra.info/product_info.php?products_id=1256
Description NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randomly from the written part of the CINTIL corpus. The corpus is PoS-annotated at token level, including punctuation. Noun Phrases were annotated with specific tags. It was automatically PoS-tagged with MBT tagger, and lemmatized with MBLEM, following the annotation scheme of the Corpus of Reference of Contemporary Portuguese.
Languages Portuguese (por)
Nepali Monolingual written corpus 683 Mb The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (… Nepali (nep) ELRA-W0076 Details
Nepali Monolingual written corpus
Name Nepali Monolingual written corpus (ELRA-W0076)
URL http://catalog.elra.info/product_info.php?products_id=1216
Description The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (CS) represents the collection of Nepali written texts from 15 different genres with 2000 words each published between 1990 and 1992. It is based on FLOB/FROWN corpora and contains 802,000 words. The general corpus (GC) consists of written texts collected opportunistically from a wide range of sources such as the internet webs, newspapers, books, publishers and authors. It contains 1,400,000 words.
Languages Nepali (nep)
PANACEA English-French and English-Greek parallel corpus acquired for Environment domain 11 Mb This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Environment doma… English (eng); French (fre) … ELRA-W0057 Details
PANACEA English-French and English-Greek parallel corpus acquired for Environment domain
Name PANACEA English-French and English-Greek parallel corpus acquired for Environment domain (ELRA-W0057)
URL http://catalog.elra.info/product_info.php?products_id=1182
Description This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Environment domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets.
Languages
  • English (eng)
  • French (fre)
PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain 16 Mb This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Labour Legislati… English (eng); Greek, Modern (1453-) (gre… ELRA-W0058 Details
PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain
Name PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain (ELRA-W0058)
URL http://catalog.elra.info/product_info.php?products_id=1183
Description This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Labour Legislation domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets.
Languages
  • English (eng)
  • Greek, Modern (1453-) (gre)
PANACEA Environment English monolingual corpus 2.7 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the English lan… English (eng) ELRA-W0063 Details
PANACEA Environment English monolingual corpus
Name PANACEA Environment English monolingual corpus (ELRA-W0063)
URL http://catalog.elra.info/product_info.php?products_id=1184
Description This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 50,541,538 tokens, divided into a total of 28,071 documents that were crawled from 3,121 web sites.
Languages English (eng)
PANACEA Environment French monolingual corpus 2.1 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the French lang… French (fre) ELRA-W0065 Details
PANACEA Environment French monolingual corpus
Name PANACEA Environment French monolingual corpus (ELRA-W0065)
URL http://catalog.elra.info/product_info.php?products_id=1186
Description This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 47,364,125 tokens, divided into a total of 23,514 documents that were crawled from 1,969 web sites.
Languages French (fre)
PANACEA Environment Greek monolingual corpus 2 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek langu… Greek, Modern (1453-) (gre) … ELRA-W0067 Details
PANACEA Environment Greek monolingual corpus
Name PANACEA Environment Greek monolingual corpus (ELRA-W0067)
URL http://catalog.elra.info/product_info.php?products_id=1188
Description This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 27,958,530 tokens, divided into a total of 16,073 documents that were crawled from 1,063 web sites.
Languages Greek, Modern (1453-) (gre)
PANACEA Environment Italian monolingual corpus 1.8 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian lan… Italian (ita) ELRA-W0069 Details
PANACEA Environment Italian monolingual corpus
Name PANACEA Environment Italian monolingual corpus (ELRA-W0069)
URL http://catalog.elra.info/product_info.php?products_id=1190
Description This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 40,044,852 tokens, divided into a total of 16,159 documents that were crawled from 1,211 web sites.
Languages Italian (ita)
PANACEA Environment Spanish monolingual corpus 2.3 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish lan… Spanish, Castilian (spa) ELRA-W0071 Details
PANACEA Environment Spanish monolingual corpus
Name PANACEA Environment Spanish monolingual corpus (ELRA-W0071)
URL http://catalog.elra.info/product_info.php?products_id=1192
Description This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the Environment domain. It was constructed in the summer of 2011. It contains 46,225,624 tokens, divided into a total of 26,009 documents that were crawled from 2,053 web sites.
Languages Spanish, Castilian (spa)
PANACEA Labour English monolingual corpus 1.6 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the English lan… English (eng) ELRA-W0064 Details
PANACEA Labour English monolingual corpus
Name PANACEA Labour English monolingual corpus (ELRA-W0064)
URL http://catalog.elra.info/product_info.php?products_id=1185
Description This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 46,431,351 tokens, divided into a total of 15,197 documents that were crawled from 1,558 web sites.
Languages English (eng)
PANACEA Labour French monolingual corpus 2.5 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the French lang… French (fre) ELRA-W0066 Details
PANACEA Labour French monolingual corpus
Name PANACEA Labour French monolingual corpus (ELRA-W0066)
URL http://catalog.elra.info/product_info.php?products_id=1187
Description This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 56,440,425 tokens, divided into a total of 26,675 documents that were crawled from 1,391 web sites.
Languages French (fre)
PANACEA Labour Greek monolingual corpus 1.4 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek langu… Greek, Modern (1453-) (gre) … ELRA-W0068 Details
PANACEA Labour Greek monolingual corpus
Name PANACEA Labour Greek monolingual corpus (ELRA-W0068)
URL http://catalog.elra.info/product_info.php?products_id=1189
Description This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 21,077,196 tokens, divided into a total of 7,124 documents that were crawled from 598 web sites.
Languages Greek, Modern (1453-) (gre)
PANACEA Labour Italian monolingual corpus 2.4 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian lan… Italian (ita) ELRA-W0070 Details
PANACEA Labour Italian monolingual corpus
Name PANACEA Labour Italian monolingual corpus (ELRA-W0070)
URL http://catalog.elra.info/product_info.php?products_id=1191
Description This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 70,563,320 tokens, divided into a total of 12,706 documents that were crawled from 864 web sites.
Languages Italian (ita)
PANACEA Labour Spanish monolingual corpus 1.9 Gb This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish lan… Spanish, Castilian (spa) ELRA-W0072 Details
PANACEA Labour Spanish monolingual corpus
Name PANACEA Labour Spanish monolingual corpus (ELRA-W0072)
URL http://catalog.elra.info/product_info.php?products_id=1193
Description This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the Labour Legislation domain. It was constructed in the summer of 2011. It contains 53,922,118 tokens, divided into a total of 13,188 documents that were crawled from 1,015 web sites.
Languages Spanish, Castilian (spa)
PAROLE French Corpus 349 Mb The PAROLE French corpus contains the following data: Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual … French (fre) ELRA-W0020 Details
PAROLE French Corpus
Name PAROLE French Corpus (ELRA-W0020)
URL http://catalog.elra.info/product_info.php?products_id=565
Description The PAROLE French corpus contains the following data: Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual and Parallel Corpora) 2 025 964 words Books: CNRS Editions 3 267 409 words Periodicals: CNRS Info, Hermès 942 963 words Newspapers: Le Monde, provided by ELRA 13 856 763 words Total 20 093 099 words
Languages French (fre)
PAROLE Irish Distributable Corpus 25 Mb This corpus consists of over 8 million words The text is marked-up in accordance with the PAROLE encoding standard. All… Irish (gle) ELRA-W0026 Details
PAROLE Irish Distributable Corpus
Name PAROLE Irish Distributable Corpus (ELRA-W0026)
URL http://catalog.elra.info/product_info.php?products_id=597
Description This corpus consists of over 8 million words The text is marked-up in accordance with the PAROLE encoding standard. All the files are in SGML format with a detailed header and the body of the text tagged to paragraph level. A subset of the corpus is morpho-syntactically tagged. Included in this distribution is approximately 3,000 manually checked words.
Languages Irish (gle)
PAROLE Italian Corpus 44 Mb The PAROLE Italian Corpus comprises 3,135,651 words collected from four different domains: newspapers (2,179,800 words)… Italian (ita) ELRA-W0043 Details
PAROLE Italian Corpus
Name PAROLE Italian Corpus (ELRA-W0043)
URL http://catalog.elra.info/product_info.php?products_id=886
Description The PAROLE Italian Corpus comprises 3,135,651 words collected from four different domains: newspapers (2,179,800 words), periodicals (143,810 words), books (564,964 words), miscellaneous (247,077 words). About 250,000 words were morphosyntactically annotated and lemmatized.
Languages Italian (ita)
PAROLE Portuguese Corpus - complete version 57 Mb The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Mediu… Portuguese (por) ELRA-W0024-01 Details
PAROLE Portuguese Corpus - complete version
Name PAROLE Portuguese Corpus - complete version (ELRA-W0024-01)
URL http://catalog.elra.info/product_info.php?products_id=765
Description The parole Portuguese corpus contains approximately 3 million running words of European Portuguese distributed by Medium (Newspaper, Book, Periodical, Miscellaneous). The corpus was classified and encoded according to the common core parole encoding standard. The file format of this corpus is SGML. Also availabe, a subcorpus consists of about 250,000 words morpho-syntactically tagged. Disambiguation was manually checked.
Languages Portuguese (por)
PRESS 65 6.3 Mb Over 1 million running words taken from Swedish newspapers from year 65. Swedish (swe) ELRA-W0010 Details
PRESS 65
Name PRESS 65 (ELRA-W0010)
URL http://catalog.elra.info/product_info.php?products_id=48
Description Over 1 million running words taken from Swedish newspapers from year 65.
Languages Swedish (swe)
PTPARL Corpus 25 Mb The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The … Portuguese (por) ELRA-W0060 Details
PTPARL Corpus
Name PTPARL Corpus (ELRA-W0060)
URL http://catalog.elra.info/product_info.php?products_id=1179
Description The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The corpus contains 1,000,441 tokens. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
Languages Portuguese (por)
Persian 1984 corpus (Multext-East framework) 5.9 Mb This corpus contains the Persian (Farsi) translation of a part of the novel 1984 (G. Orwell) annotated in the Multext-E… Persian (per) ELRA-W0054 Details
Persian 1984 corpus (Multext-East framework)
Name Persian 1984 corpus (Multext-East framework) (ELRA-W0054)
URL http://catalog.elra.info/product_info.php?products_id=1124
Description This corpus contains the Persian (Farsi) translation of a part of the novel 1984 (G. Orwell) annotated in the Multext-East framework (Multilingual Text Tools and Corpora for Eastern and Central European Languages). The corpus contains approximately 100,000 words (6,604 sentences, 13,247 lemmas), with extensive headers and markup for document structure, sentences, and various sub-sentence annotations in the XML-format following the TEI guidelines. Annotation includes POS (part-of-speech) and lemmas.
Languages Persian (per)
Quaero Old Press Extended Named Entity corpus 6.8 Gb This corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French … French (fre) ELRA-W0073 Details
Quaero Old Press Extended Named Entity corpus
Name Quaero Old Press Extended Named Entity corpus (ELRA-W0073)
URL http://catalog.elra.info/product_info.php?products_id=1194
Description This corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French National Library (Bibliothèque Nationale de France). Three different titles are used (Le Temps, La Croix and Le Figaro) for a total of 295 pages. The corpus is fully manually annotated according to the Quaero extended and structured named entity definition.
Languages French (fre)
Qualified POS Tagged Corpus 66 Mb Monolingual corpus in a .txt format, produced by KAIST KORTERM, containing 1020000 eojeols (Korean terms) in Korean. Th… Korean (kor) ELRA-W0034 Details
Qualified POS Tagged Corpus
Name Qualified POS Tagged Corpus (ELRA-W0034)
URL http://catalog.elra.info/product_info.php?products_id=654
Description Monolingual corpus in a .txt format, produced by KAIST KORTERM, containing 1020000 eojeols (Korean terms) in Korean. This corpus is morphologically analyzed, POS tagged, and rectified 3 times by specialists.
Languages Korean (kor)
ROCO Romanian journalistic corpus 729 Mb ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. … Romanian (rum) ELRA-W0085 Details
ROCO Romanian journalistic corpus
Name ROCO Romanian journalistic corpus (ELRA-W0085)
URL http://catalog.elra.info/product_info.php?products_id=1249
Description ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. It is rich in proper names, numerals and named entities. The corpus has been lemmatized and PoS annotated following the Multext-East morphosyntactic specifications, and it is XML encoded.
Languages Romanian (rum)
ROMBAC - Romanian balanced corpus 1.1 Gb ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, me… Romanian (rum) ELRA-W0088 Details
ROMBAC - Romanian balanced corpus
Name ROMBAC - Romanian balanced corpus (ELRA-W0088)
URL http://catalog.elra.info/product_info.php?products_id=1253
Description ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, medicine and biographical data for Romanian literary personalities. The entire corpus counts around 41,000,000 words, including punctuation. The corpus is annotated at paragraph, sentence, constituent group and word levels, and it provides morpho-syntactic information (MSD). It is xml encoded.
Languages Romanian (rum)
TRAD Pashto Monolingual text Corpus 2.2 Gb This is a monolingual text corpus in Pashto. The corpus contains about 112,000,000 tokens collected from 46 different b… Pushto (pus) ELRA-W0092 Details
TRAD Pashto Monolingual text Corpus
Name TRAD Pashto Monolingual text Corpus (ELRA-W0092)
URL http://catalog.elra.info/product_info.php?products_id=1266
Description This is a monolingual text corpus in Pashto. The corpus contains about 112,000,000 tokens collected from 46 different blogs and websites.
Languages Pushto (pus)
TRAD Pashto-English News Articles Parallel corpus 602 Kb This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. The… English (eng); Pushto (pus) … ELRA-W0097 Details
TRAD Pashto-English News Articles Parallel corpus
Name TRAD Pashto-English News Articles Parallel corpus (ELRA-W0097)
URL http://catalog.elra.info/product_info.php?products_id=1271
Description This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto.
Languages
  • English (eng)
  • Pushto (pus)
TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data 575 Kb This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 bro… English (eng); Pushto (pus) … ELRA-W0095 Details
TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data
Name TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech - Test data (ELRA-W0095)
URL http://catalog.elra.info/product_info.php?products_id=1269
Description This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381).
Languages
  • English (eng)
  • Pushto (pus)
TRAD Pashto-French News Articles Parallel corpus 970 Kb This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The … French (fre); Pushto (pus) … ELRA-W0096 Details
TRAD Pashto-French News Articles Parallel corpus
Name TRAD Pashto-French News Articles Parallel corpus (ELRA-W0096)
URL http://catalog.elra.info/product_info.php?products_id=1270
Description This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto.
Languages
  • French (fre)
  • Pushto (pus)
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data 29 Mb This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The … French (fre); Pushto (pus) … ELRA-W0094 Details
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data
Name TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Test data (ELRA-W0094)
URL http://catalog.elra.info/product_info.php?products_id=1268
Description This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381).
Languages
  • French (fre)
  • Pushto (pus)
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data 473 Mb This corpus consists of the transcription of 106 hours of recordings in Pashto from the TRAD Pashto Broadcast News Spee… French (fre); Pushto (pus) … ELRA-W0093 Details
TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data
Name TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech - Training data (ELRA-W0093)
URL http://catalog.elra.info/product_info.php?products_id=1267
Description This corpus consists of the transcription of 106 hours of recordings in Pashto from the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381) translated into French. It contains about 832,000 source words and 747,000 target words.
Languages
  • French (fre)
  • Pushto (pus)
TSNLP (Test Suites for NLP Testing) 4.5 Mb Test Suites for Natural Language Processing. 4,000 test items (sentences or fragments of sentences) in English, French… English (eng); French (fre); German (ger)… ELRA-W0013 Details
TSNLP (Test Suites for NLP Testing)
Name TSNLP (Test Suites for NLP Testing) (ELRA-W0013)
URL http://catalog.elra.info/product_info.php?products_id=51
Description Test Suites for Natural Language Processing. 4,000 test items (sentences or fragments of sentences) in English, French & German, useful for NL system evaluation.
Languages
  • English (eng)
  • French (fre)
  • German (ger)
Tagged text in French (MEMODATA) with rules of morphological disambiguation 3.1 Gb More than 170 books (classical novels, legal texts...) are tagged with rules of morphological disambiguation. A tagged … French (fre) ELRA-W0012 Details
Tagged text in French (MEMODATA) with rules of morphological disambiguation
Name Tagged text in French (MEMODATA) with rules of morphological disambiguation (ELRA-W0012)
URL http://catalog.elra.info/product_info.php?products_id=50
Description More than 170 books (classical novels, legal texts...) are tagged with rules of morphological disambiguation. A tagged corpus of 50 books is available for research. It consists of several authors of the 19th century (Balzac, Hugo, Stendhal). See also W0011.
Languages French (fre)
Tagged text in French (MEMODATA) with typographic tags 247 Mb Over 170 (tagged) French books (classical novels, legal texts) with typographic tags. Another tagged corpus of 50 books… French (fre) ELRA-W0011 Details
Tagged text in French (MEMODATA) with typographic tags
Name Tagged text in French (MEMODATA) with typographic tags (ELRA-W0011)
URL http://catalog.elra.info/product_info.php?products_id=49
Description Over 170 (tagged) French books (classical novels, legal texts) with typographic tags. Another tagged corpus of 50 books is available for research only. The books consist of authors of the 19th century. See also W0012.
Languages French (fre)
Text corpus of "Le Monde" (1987-2012) 3.9 Gb Corpus from "Le Monde" newspaper. Each year contains some 10 Mbytes of data per month (circa 120 Mbytes per year). Data… French (fre) ELRA-W0015 Details
Text corpus of "Le Monde" (1987-2012)
Name Text corpus of "Le Monde" (1987-2012) (ELRA-W0015)
URL http://catalog.elra.info/product_info.php?products_id=438
Description Corpus from "Le Monde" newspaper. Each year contains some 10 Mbytes of data per month (circa 120 Mbytes per year). Data ranging from 1987 until 2012 are available (total 1,199,143 articles).
Languages French (fre)
The CINTIL Corpus International Corpus of Portuguese 20 Mb CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portugue… Portuguese (por) ELRA-W0050 Details
The CINTIL Corpus International Corpus of Portuguese
Name The CINTIL Corpus International Corpus of Portuguese (ELRA-W0050)
URL http://catalog.elra.info/product_info.php?products_id=1102
Description CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). The corpus is developed over raw textual materials of several types, of which 30% are spoken materials.
Languages Portuguese (por)
The EMILLE/CIIL Corpus 1.5 Gb The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian lan… Urdu (urd); Telugu (tel); Tamil (tam); Si… ELRA-W0037 Details
The EMILLE/CIIL Corpus
Name The EMILLE/CIIL Corpus (ELRA-W0037)
URL http://catalog.elra.info/product_info.php?products_id=696
Description The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus.
Languages
  • Urdu (urd)
  • Telugu (tel)
  • Tamil (tam)
  • Sinhalese (sin)
  • Panjabi, Punjabi (pan)
  • Oriya (ori)
  • Marathi (mar)
  • Malayalam (mal)
  • Kashmiri (kas)
  • Kannada (kan)
  • Hindi (hin)
  • Gujarati (guj)
  • Bengali (ben)
  • Assamese (asm)
The Lancaster Corpus of Mandarin Chinese (LCMC) 45 Mb The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written text categories including news, literary texts, acad… Chinese (chi) ELRA-W0039 Details
The Lancaster Corpus of Mandarin Chinese (LCMC)
Name The Lancaster Corpus of Mandarin Chinese (LCMC) (ELRA-W0039)
URL http://catalog.elra.info/product_info.php?products_id=715
Description The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written text categories including news, literary texts, academic prose and official documents etc published in P. R. China in the earlier 1990s for a total of approximately 1 million words. The same sampling frame and period as FLOB/FROWN were used in LCMC. The corpus is encoded in Unicode (UTF-8) and marked up in XML.
Languages Chinese (chi)
Venice Italian Treebank (VIT) 149 Mb The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, pol… Italian (ita) ELRA-W0040 Details
Venice Italian Treebank (VIT)
Name Venice Italian Treebank (VIT) (ELRA-W0040)
URL http://catalog.elra.info/product_info.php?products_id=831
Description The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, political, economic and financial, literary, scientific, and news. In addition, some 60,000 tokens of spoken dialogues in different Italian varieties were annotated. The annotation follows general X-bar criteria with 29 constituency labels and 102 PoS tags. VIT is also made available in a broad annotation version with 10 constituency labels and 22 PoS tags for machine learning purposes. The format is plain text with square bracketing. However, a UPenn style version which is readable by the open source query language CorpusSearch is also provided.
Languages Italian (ita)
Wolverhampton Business English Corpus 118 Mb Produced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the frame… English (eng) ELRA-W0028 Details
Wolverhampton Business English Corpus
Name Wolverhampton Business English Corpus (ELRA-W0028)
URL http://catalog.elra.info/product_info.php?products_id=627
Description Produced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsPProduced by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsP&P (Language Resources Production & Packaging - LE4-8335), the Business English Corpus consists of 10.186.259 words collected from 23 different Web sites related to business.
Languages English (eng)
deL1L2IM corpus 2.8 Mb The deL1L2IM corpus is composed of 72 dialogues, each of them having a duration of 20 to 45 minutes. The whole corpus c… German (ger) ELRA-W0083 Details
deL1L2IM corpus
Name deL1L2IM corpus (ELRA-W0083)
URL http://catalog.elra.info/product_info.php?products_id=1243
Description The deL1L2IM corpus is composed of 72 dialogues, each of them having a duration of 20 to 45 minutes. The whole corpus contains ca. 52,000 words and 4,800 messages and has a file size of 0.5 Mb. Nine pairs of participants i.e. nine learners and four native speakers were required, with 8 dialogues per pair. The interactions have undergone linguistic analysis whereby the annotation will be performed only on repair/correction sequences (incomplete learner error annotation). The corpus is delivered in one written text file (in XML format, customized under TEI P5).
Languages German (ger)
Name Size Description Language ELRA Details Your selection