DE eng

Search in the Catalogues and Directories

Hits 1 – 19 of 19

1
Sentence Boundary Extraction from Scientific Literature of Electric Double Layer Capacitor Domain: Tools and Techniques
In: Applied Sciences; Volume 12; Issue 3; Pages: 1352 (2022)
BASE
Show details
2
Free Software Tools for Computational Linguistics: An Overview ...
Đurić D., Miloš. - : Zenodo, 2021
BASE
Show details
3
Free Software Tools for Computational Linguistics: An Overview ...
Đurić D., Miloš. - : Zenodo, 2021
BASE
Show details
4
Free Software Tools for Computational Linguistics: An Overview ...
Đurić D., Miloš. - : Zenodo, 2020
BASE
Show details
5
Free Software Tools for Computational Linguistics: An Overview ...
Đurić D., Miloš. - : Zenodo, 2020
BASE
Show details
6
A Framework for the Eurasian Latin Archive using CLTK and NLTK ...
BASE
Show details
7
A Framework for the Eurasian Latin Archive using CLTK and NLTK ...
BASE
Show details
8
Survey on Sentiment Analysis Using Machine Learning ...
BASE
Show details
9
Survey on Sentiment Analysis Using Machine Learning ...
BASE
Show details
10
Evaluating morphosyntactic differences in narrative re-tell tasks between bilingual children with and without language impairment using computational methods
BASE
Show details
11
SNET : a statistical normalisation method for Twitter
BASE
Show details
12
SNET : a statistical normalisation method for Twitter
Sosamphan, Phavanh. - : Unitec Institute of Technology, 2016
BASE
Show details
13
SNET : a statistical normalisation method for Twitter
Abstract: One of the major problems in the era of big data use is how to ‘clean’ the vast amount of data on the Internet, particularly data in the micro-blog website Twitter. Twitter enables people to connect with their friends, colleagues, or even new people who they have never met before. Twitter, one of the world’s biggest social media networks, has around 316 million users, and 500 million tweets posted per day (Twitter, 2016). Undoubtedly, social media networks create huge opportunities in helping businesses build relationships with customers, gain more insights into their customers, and deliver more value to them. Despite all the advantages of Twitter use, comments – called tweets - posted on social media networks may not be all that useful if they contain irrelevant and incomprehensible information, therefore making it difficult to analyse. Tweets are commonly written in ‘ill-forms’, such as abbreviations, repeated characters, and misspelled words. These ‘noisy tweets’ become text normalisation challenges in terms of selecting the proper methods to detect and convert them into the most accurate English sentences. There are several existing text cleaning techniques which are proposed to solve the issues, however they possess some limitations and still do not achieve good results overall. In this research, our aim is to propose the SNET, a statistical normalisation method for cleaning noisy tweets at character-level (which contain abbreviations, repeated letters, and misspelled words) that combines different techniques to achieve more accurate and clean data. To clean noisy tweets, existing techniques are evaluated in order to find the best solution by combining techniques so as to solve all problems with high accuracy. This research proposes that abbreviations are converted to their standard form by using abbreviations dictionary lookup, while repeated characters are normalised by the Natural Language Toolkit (NLTK) platform and a dictionary based approach. Besides the NLTK, the edit distance algorithm is also utilised as a means of solving misspelling problems, while “Enchant” dictionary can be used to access the spell checking library. Furthermore, existing models, such as a spell corrector, can be deployed for conversion purposes, while text cleanser is advanced as superior for comparing the SNET with a baseline model. With experiments on a Twitter sample dataset, our results show that the SNET satisfies 88% accuracy in the Bilingual Evaluation Understudy (BLEU) score and 7% in the word error rate (WER) score, both of which are better than the baseline model. Devising such a method to clean tweets can make a great contribution in terms of its adoption in brand sentiment analysis or opinion mining, political analysis, and other applications seeking to make sound predictions.
Keyword: 080109 Pattern Recognition and Data Mining; 150502 Marketing Communications; abbreviations; big data; data normalisation; micro-blogs; Natural Language Toolkit (NLTK); noisy tweets; normalisation; social media; spell checkers; text cleansers; tweets; Twitter
URL: https://hdl.handle.net/10652/3508
BASE
Hide details
14
Computational Linguistic Analysis of Earthquake Collections
BASE
Show details
15
Automatic Text Analysis Using Drupal
In: Computer Engineering (2013)
BASE
Show details
16
Multidisciplinary Instruction with the Natural Language Toolkit
In: TEACH CL-08 Third Workshop on Issues in Teaching Computational Linguistics (2008)
BASE
Show details
17
Managing Fieldwork Data with Toolbox and the Natural Language Toolkit
Robinson, Stuart; Aumann, Greg; Bird, Steven. - : University of Hawai'i Press, 2007
BASE
Show details
18
Managing Fieldwork Data with Toolbox and the Natural Language Toolkit
In: Language Documentation & Conservation, Vol 1, Iss 1 (2007) (2007)
BASE
Show details
19
Desenvolupament d'un assistent de redacció per l'anglès com a llengua estrangera
BASE
Show details

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
19
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern