google-ngram-downloader icon indicating copy to clipboard operation
google-ngram-downloader copied to clipboard

Google published new ngrams, 20200217

Open lahosken opened this issue 3 years ago • 2 comments

https://storage.googleapis.com/books/ngrams/books/datasetsv3.html . For an URL example, one file of ngrams is at http://storage.googleapis.com/books/ngrams/books/20200217/eng/1-00016-of-00024.gz

lahosken avatar Jun 05 '21 21:06 lahosken

Indeed, I tried to debug. Necessary code changes seem to be restricted to util.py. However, new problems arise. Let me use German v3 2-grams as a reference (version 20200217). The challenges are:

  1. No variable to pick version in the code right now.
  2. Different naming scheme for URLs from which the data is downloaded, see your ...00016-of-00024.gz URL above
  3. Google n-gram v3 line structure seems to have changed as compared to v2.

Changing the code of def iter_google_store(...) in util.py from version = '20120701' to another causes a new bug, the file template doesn't match anymore. Then, it should be FILE_TEMPLATE_GER_NEW = '{ngram_len}-{index}-of-{full_number}.gz' instead of
FILE_TEMPLATE = 'googlebooks-{lang}-all-{ngram_len}gram-{version}-{index}.gz', Commenting out assert len(data) == 4 in the function definition def readline_google_store.

In def iter_google_store(...) we need to get the full_number right for the proper URL of the files. This number depends on the language lang (german in my case) and ngram_len; for that case it is

#version = '20120701'
version = '20200217' # New: v3
session = requests.Session()

# Case-By-Case lookup of total number of {gram_len} grams
if(version=='20200217' and ngram_len==1):
    full_number = '00008'
if(version=='20200217' and ngram_len==2):
    full_number = '00181'
elif(version=='20200217' and ngram_len==3):
    full_number = '01369'
elif(version=='20200217' and ngram_len==4):
    full_number = '01003'
elif(version=='20200217' and ngram_len==5):
    full_number = '02262'
else:
    full_number = 0

Printing the line (old version, 20120701) yields 0 0005_NUM 1901 1 1 which is 4 lines (n-gram, year, count, publication) as asserted in the code and mentioned in the documentation. The 1st line of the new version has 29 entries though. It took me some time to figure out that these are all year/counts/publication triplets, e.g. `1929,1,1', '1930,5,3', etc.

I summed the counts/publications up across years and used the first year of appearance as the year, i.e.

ngram = data[0]
if(version == '20200217' and lang == 'ger'):
       (min_year, count, pubs) = (min([int(data_loc.split(',')[0]) for data_loc in data[1:]]), 
                                                     sum([int(data_loc.split(',')[1]) for data_loc in data[1:]]), 
                                                     sum([int(data_loc.split(',')[2]) for data_loc in data[1:]]))
         other  = [min_year, count, pubs]
# older version (v2/v1) 
else:
        assert len(data) == 4
        other = map(int, data[1:5])

yield Record(ngram, *other)

However, this only happens for the German v3 n-grams (i.e. version = 20200217).

7shoe avatar Jun 07 '21 03:06 7shoe

Thanks for the analysis. I'll have a look what v3 has to offer.

Pull requests are welcome.

dimazest avatar Jun 08 '21 00:06 dimazest