1 |
Machine translation of user-generated content
|
|
Lohar, Pintu. - : Dublin City University. School of Computing, 2020. : Dublin City University. ADAPT, 2020
|
|
In: Lohar, Pintu (2020) Machine translation of user-generated content. PhD thesis, Dublin City University. (2020)
|
|
Abstract:
The world of social media has undergone huge evolution during the last few years. With the spread of social media and online forums, individual users actively participate in the generation of online content in different languages from all over the world. Sharing of online content has become much easier than before with the advent of popular websites such as Twitter, Facebook etc. Such content is referred to as ‘User-Generated Content’ (UGC). Some examples of UGC are user reviews, customer feedback, tweets etc. In general, UGC is informal and noisy in terms of linguistic norms. Such noise does not create significant problems for human to understand the content, but it can pose challenges for several natural language processing applications such as parsing, sentiment analysis, machine translation (MT), etc. An additional challenge for MT is sparseness of bilingual (translated) parallel UGC corpora. In this research, we explore the general issues in MT of UGC and set some research goals from our findings. One of our main goals is to exploit comparable corpora in order to extract parallel or semantically similar sentences. To accomplish this task, we design a document alignment system to extract semantically similar bilingual document pairs using the bilingual comparable corpora. We then apply strategies to extract parallel or semantically similar sentences from comparable corpora by transforming the document alignment system into a sentence alignment system. We seek to improve the quality of parallel data extraction for UGC translation and assemble the extracted data with the existing human translated resources. Another objective of this research is to demonstrate the usefulness of MT-based sentiment analysis. However, when using openly available systems such as Google Translate, the translation process may alter the sentiment in the target language. To cope with this phenomenon, we instead build fine-grained sentiment translation models that focus on sentiment preservation in the target language during translation.
|
|
Keyword:
Computational linguistics; Information retrieval; Machine learning; Machine translating; Translating and interpreting; User-Generated Content
|
|
URL: http://doras.dcu.ie/24988/
|
|
BASE
|
|
Hide details
|
|
2 |
FooTweets: a bilingual parallel corpus of World Cup tweets
|
|
|
|
In: Sluyter-Gäthje, Henny, Lohar, Pintu, Afli, Haithem orcid:0000-0002-7449-4707 and Way, Andy orcid:0000-0001-5736-5930 (2018) FooTweets: a bilingual parallel corpus of World Cup tweets. In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, 7-12 May 2018, Miyazaki, Japan. ISBN 979-10-95546-00-9 (2018)
|
|
BASE
|
|
Show details
|
|
3 |
Balancing translation quality and sentiment preservation
|
|
|
|
In: Lohar, Pintu, Afli, Haithem orcid:0000-0002-7449-4707 and Way, Andy orcid:0000-0001-5736-5930 (2018) Balancing translation quality and sentiment preservation. In: AMTA 2018, 17-21 Mar 2018, Boston, MA. USA. (2018)
|
|
BASE
|
|
Show details
|
|
4 |
Sentiment translation for low resourced languages: experiments on Irish general election Tweets
|
|
|
|
In: Afli, Haithem orcid:0000-0002-7449-4707 , Maguire, Sorcha and Way, Andy orcid:0000-0001-5736-5930 (2017) Sentiment translation for low resourced languages: experiments on Irish general election Tweets. In: 18th International Conference on Computational Linguistics and Intelligent Text Processing, 17-21 Apr 2017, Budapest, Hungry. (2017)
|
|
BASE
|
|
Show details
|
|
5 |
MultiNews: a web collection of an aligned multimodal and multilingual corpus
|
|
|
|
In: Afli, Haithem orcid:0000-0002-7449-4707 , Lohar, Pintu and Way, Andy orcid:0000-0001-5736-5930 (2017) MultiNews: a web collection of an aligned multimodal and multilingual corpus. In: Workshop on Curation and Applications of Parallel and Comparable Corpora, 27 Nov- 1 Dec 2017, Taipei, Taiwan. ISBN 978-1-948087-05-6 (2017)
|
|
BASE
|
|
Show details
|
|
6 |
Maintaining sentiment polarity in translation of user-generated content
|
|
|
|
In: Lohar, Pintu, Afli, Haithem orcid:0000-0002-7449-4707 and Way, Andy orcid:0000-0001-5736-5930 (2017) Maintaining sentiment polarity in translation of user-generated content. Prague Bulletin of Mathematical Linguistics (108). pp. 73-84. ISSN 1804-0462 (2017)
|
|
BASE
|
|
Show details
|
|
7 |
Identifying effective translations for cross-lingual Arabic-to-English user-generated speech search
|
|
|
|
In: Khwileh, Ahmad, Afli, Haithem orcid:0000-0002-7449-4707 , Jones, Gareth J.F. orcid:0000-0003-2923-8365 and Way, Andy orcid:0000-0001-5736-5930 (2017) Identifying effective translations for cross-lingual Arabic-to-English user-generated speech search. In: Third Arabic Natural Language Processing Workshop (WANLP), 3 Apr 2017, Valencia, Spain. (2017)
|
|
BASE
|
|
Show details
|
|
8 |
Dublin City University participation in the VTT track at TRECVid 2017
|
|
|
|
In: Afli, Haithem orcid:0000-0002-7449-4707 , Hu, Feiyan orcid:0000-0001-7451-6438 , Du, Jinhua orcid:0000-0002-3267-4881 , Cosgrove, Daniel, McGuinness, Kevin orcid:0000-0003-1336-6477 , O'Connor, Noel E. orcid:0000-0002-4033-9135 , Arazo Sánchez, Eric, Zhou, Jiang orcid:0000-0002-3067-8512 and Smeaton, Alan F. orcid:0000-0003-1028-8389 (2017) Dublin City University participation in the VTT track at TRECVid 2017. In: TRECVid workshop, 13-15 Nov 2017, Gaithersburg, Md., USA. (2017)
|
|
BASE
|
|
Show details
|
|
9 |
Identifying effective translations for cross-lingual Arabic-to-English user-generated speech search
|
|
|
|
In: Khwileh, Ahmad, Afli, Haithem orcid:0000-0002-7449-4707 , Jones, Gareth J.F. orcid:0000-0003-2923-8365 and Way, Andy orcid:0000-0001-5736-5930 (2017) Identifying effective translations for cross-lingual Arabic-to-English user-generated speech search. In: Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), 3-4 Apr 2017, Valencia, Spain. (2017)
|
|
BASE
|
|
Show details
|
|
10 |
Maintaining Sentiment Polarity in Translation of User-Generated Content
|
|
|
|
In: Prague Bulletin of Mathematical Linguistics , Vol 108, Iss 1, Pp 73-84 (2017) (2017)
|
|
BASE
|
|
Show details
|
|
11 |
The ADAPT bilingual document alignment system at WMT16
|
|
|
|
In: Lohar, Pintu, Afli, Haithem orcid:0000-0002-7449-4707 , Liu, Chao-Hong orcid:0000-0002-1235-6026 and Way, Andy orcid:0000-0001-5736-5930 (2016) The ADAPT bilingual document alignment system at WMT16. In: First Conference on Machine Translation (WMT16), 11-12 Aug 2016, Berlin, Germany. (2016)
|
|
BASE
|
|
Show details
|
|
12 |
FaDA: fast document aligner using word embedding
|
|
|
|
In: Lohar, Pintu, Ganguly, Debasis orcid:0000-0003-0050-7138 , Afli, Haithem orcid:0000-0002-7449-4707 , Way, Andy orcid:0000-0001-5736-5930 and Jones, Gareth J.F. orcid:0000-0003-2923-8365 (2016) FaDA: fast document aligner using word embedding. Prague Bulletin of Mathematical Linguistics (106). pp. 169-179. ISSN 1804-0462 (2016)
|
|
BASE
|
|
Show details
|
|
13 |
Using SMT for OCR error correction of historical texts
|
|
|
|
In: Afli, Haithem orcid:0000-0002-7449-4707 , Qui, Zhengwei, Way, Andy orcid:0000-0001-5736-5930 and Sheridan, Páraic (2016) Using SMT for OCR error correction of historical texts. In: Tenth International Conference on Language Resources and Evaluation (LREC 2016), 23-28 May 2016, Portorož, Slovenia. ISBN 978-2-9517408-9-1 (2016)
|
|
BASE
|
|
Show details
|
|
14 |
Integrating optical character recognition and machine translation of historical documents
|
|
|
|
In: Afli, Haithem orcid:0000-0002-7449-4707 and Way, Andy orcid:0000-0001-5736-5930 (2016) Integrating optical character recognition and machine translation of historical documents. In: COLING, the 26th International Conference on Computational Linguistics, 13-16 Dec 2016, Osaka, Japan. (2016)
|
|
BASE
|
|
Show details
|
|
15 |
From Arabic user-generated content to machine translation: integrating automatic error correction
|
|
|
|
In: Afli, Haithem orcid:0000-0002-7449-4707 , Aransa, Walid, Lohar, Pintu and Way, Andy orcid:0000-0001-5736-5930 (2016) From Arabic user-generated content to machine translation: integrating automatic error correction. In: 17th International Conference on Intelligent Text Processing and Computational Linguistics, 3–9 Apr 2016, Konya, Turkey. (2016)
|
|
BASE
|
|
Show details
|
|
16 |
Dublin City University and partners’ participation in the INS and VTT tracks at TRECVid 2016
|
|
|
|
In: Marsden, Mark, Mohedano, Eva, McGuinness, Kevin orcid:0000-0003-1336-6477 , Calafell, Andrea, Giró-i-Nieto, Xavier orcid:0000-0002-9935-5332 , O'Connor, Noel E. orcid:0000-0002-4033-9135 , Zhou, Jiang orcid:0000-0002-3067-8512 , Azevedo, Lucas, Daudert, Tobias, Davis, Brian, Hurlimann, Manuela, Afli, Haithem orcid:0000-0002-7449-4707 , Du, Jinhua, Ganguly, Debasis orcid:0000-0003-0050-7138 , Li, Wei B. orcid:0000-0001-7347-3501 , Way, Andy orcid:0000-0001-5736-5930 and Smeaton, Alan F. orcid:0000-0003-1028-8389 (2016) Dublin City University and partners’ participation in the INS and VTT tracks at TRECVid 2016. In: TRECVid Conference, 14-16 Nov 2016, Gaithersburg, Md., USA. (2016)
|
|
BASE
|
|
Show details
|
|
17 |
Dublin City University and Partners' participation in the INS and VTT Tracks at TRECVid 2016
|
|
|
|
BASE
|
|
Show details
|
|
18 |
FaDA: Fast Document Aligner using Word Embedding
|
|
|
|
In: Prague Bulletin of Mathematical Linguistics , Vol 106, Iss 1, Pp 169-179 (2016) (2016)
|
|
BASE
|
|
Show details
|
|
19 |
OCR Error Correction Using Statistical Machine Translation
|
|
|
|
In: 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2015). ; https://hal.archives-ouvertes.fr/hal-01433200 ; 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2015)., 2015, Cairo, Egypt (2015)
|
|
BASE
|
|
Show details
|
|
|
|