DE eng

Search in the Catalogues and Directories

Page: 1 2
Hits 1 – 20 of 36

1
Abstracts from the KAS corpus KAS-Abs 2.0
Žagar, Aleš; Kavaš, Matic; Robnik-Šikonja, Marko. - : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2022. : Faculty of Computer and Information Science, University of Ljubljana, 2022
BASE
Show details
2
Corpus of academic Slovene KAS 2.0
Žagar, Aleš; Kavaš, Matic; Robnik-Šikonja, Marko; Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Ferme, Marko; Borovič, Mladen; Boškovič, Borko; Ojsteršek, Milan; Hrovat, Goran. - : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2022. : Faculty of Computer and Information Science, University of Ljubljana, 2022
Abstract: The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens) written 2000 - 2018 and gathered from the digital libraries of Slovene higher education institutions via the Slovene Open Science portal (http://openscience.si/). The theses have associated with them significant metadata, while each thesis in the corpus contains its textual body, i.e. without their front and back matter. The body is divided into chapters, then into pages, these into paragraphs, and then into sentences. The sentence tokens are tagged with morphosyntactically descriptions (detailed part-of-speech tags) and the words lemmatised. As opposed to the previous version 1.0, the KAS corpus of Slovene academic writing 2.0 is cleaner and contains segmentations into chapters. The metadata also contains more information about research fields of each work. Both versions consist of the same number of BSc/BA, MSc/MA, and PhD theses, however, the processing was done from scratch for 2.0, so the number of e.g. pages and tokens is different. Note also that the new version does not contain links to the PNG pictures of individual pages , nor does it contain annotated terms, both present in version 1.0. It is, unlike 1.0, also not mounted on the CLARIN.SI concordancers. The new version is distributed in the canonical TEI encoding, JSON, and as plain text files. In the TEI format, chapter names are denoted with the tag. Each entry in JSON files have a string ID and a list containing names of chapters as its first element and texts as its second element. Chapters without text are represented as an empty string. The plain text files contain only text bodies without segmentation information. References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
Keyword: academic writing; BSc/BA theses; MSc/MA theses; PhD theses; TEI
URL: http://hdl.handle.net/11356/1448
BASE
Hide details
3
Summarization datasets from the KAS corpus KAS-Sum 1.0
Žagar, Aleš; Kavaš, Matic; Robnik-Šikonja, Marko. - : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2022. : Faculty of Computer and Information Science, University of Ljubljana, 2022
BASE
Show details
4
Machine Translation datasets from the KAS corpus KAS-MT 1.0
Žagar, Aleš; Kavaš, Matic; Robnik-Šikonja, Marko. - : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2022. : Faculty of Computer and Information Science, University of Ljubljana, 2022
BASE
Show details
5
The ParlaMint corpora of parliamentary proceedings
BASE
Show details
6
The ParlaMint corpora of parliamentary proceedings
In: Lang Resour Eval (2022)
BASE
Show details
7
Offensive language dataset of Croatian, English and Slovenian comments FRENK 1.0
Ljubešić, Nikola; Fišer, Darja; Erjavec, Tomaž. - : Jožef Stefan Institute, 2021
BASE
Show details
8
Offensive language dataset of Croatian, English and Slovenian comments FRENK 1.1
Ljubešić, Nikola; Fišer, Darja; Erjavec, Tomaž. - : Jožef Stefan Institute, 2021
BASE
Show details
9
Abstracts from the KAS corpus KAS-Abs 1.0
Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola. - : Jožef Stefan Institute, 2021. : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2021
BASE
Show details
10
English-Slovene term candidates KAS-biterm 1.0
Erjavec, Tomaž; Ljubešić, Nikola; Fišer, Darja. - : Jožef Stefan Institute, 2020
BASE
Show details
11
Corpus of academic Slovene KAS 1.0
Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola. - : Jožef Stefan Institute, 2019. : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2019
BASE
Show details
12
CMC training corpus Janes-Tag 2.1
Erjavec, Tomaž; Fišer, Darja; Čibej, Jaka. - : Jožef Stefan Institute, 2019
BASE
Show details
13
Corpus of Academic Slovene (PhD theses) KAS-dr 1.0
Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola. - : Jožef Stefan Institute, 2019. : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2019
BASE
Show details
14
Corpus of Academic Slovene (MSc/MA theses) KAS-mag 1.0
Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola. - : Jožef Stefan Institute, 2019. : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2019
BASE
Show details
15
Corpus of Academic Slovene (BSc/BA theses) KAS-dipl 1.0
Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola. - : Jožef Stefan Institute, 2019. : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2019
BASE
Show details
16
Dictionary of Twitterese Janes-Dict 1.0
Gantar, Polona; Škrjanec, Iza; Fišer, Darja. - : Faculty of Arts, University of Ljubljana, 2018
BASE
Show details
17
Dataset and baseline model of moderated content FRENK-MMC-RTV 1.0
Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja. - : Jožef Stefan Institute, 2018
BASE
Show details
18
Bilingual terminology extraction dataset KAS-biterm 1.0
Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola. - : Jožef Stefan Institute, 2018
BASE
Show details
19
Terminology identification dataset KAS-term 1.0
Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola. - : Jožef Stefan Institute, 2018
BASE
Show details
20
Dataset and baseline model of moderated content FRENK-STYRIA-24sata 1.0
Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja. - : Jožef Stefan Institute, 2018
BASE
Show details

Page: 1 2

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
34
0
2
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern