4 |
Chinese Web 5-gram Version 1 ...
|
|
|
|
Abstract:
Introduction Chinese Web 5-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2010T06 and isbn 1-58563-539-1, was created by researchers at Google Inc. It consists of Chinese word n-grams and their observed frequency counts generated from over 800 million tokens of text. The length of the n-grams ranges from unigrams (single words) to 5-grams. This data should be useful for statistical language modeling (e.g., segmentation, machine translation) as well as for other uses. Included with this publication is a simple segmenter written in Perl using the same algorithm used to generate the data. Data Collection N-gram counts were generated from approximately 883 billion word tokens of text from publicly accessible web pages. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect only Chinese language pages, some text from other languages ...
|
|
URL: https://dx.doi.org/10.35111/647p-yt29 https://catalog.ldc.upenn.edu/LDC2010T06
|
|
BASE
|
|
Hide details
|
|
5 |
Large-scale semi-supervised learning for natural language processing
|
|
Bergsma, Shane A. - : University of Alberta. Department of Computing Science., 2010
|
|
BASE
|
|
Show details
|
|
6 |
Large-scale semi-supervised learning for natural language processing
|
|
Bergsma, Shane A. - : University of Alberta. Department of Computing Science., 2010
|
|
BASE
|
|
Show details
|
|
|
|