google-ngram-downloader
google-ngram-downloader copied to clipboard
Google published new ngrams, 20200217
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html . For an URL example, one file of ngrams is at http://storage.googleapis.com/books/ngrams/books/20200217/eng/1-00016-of-00024.gz
Indeed, I tried to debug. Necessary code changes seem to be restricted to util.py
.
However, new problems arise. Let me use German v3
2-grams as a reference (version 20200217
).
The challenges are:
- No variable to pick
version
in the code right now. - Different naming scheme for URLs from which the data is downloaded, see your
...00016-of-00024.gz
URL above - Google n-gram
v3
line structure seems to have changed as compared tov2
.
Changing the code of def iter_google_store(...)
in util.py
from version = '20120701'
to another causes a new bug, the file template doesn't match anymore. Then, it should be
FILE_TEMPLATE_GER_NEW = '{ngram_len}-{index}-of-{full_number}.gz'
instead of
FILE_TEMPLATE = 'googlebooks-{lang}-all-{ngram_len}gram-{version}-{index}.gz'
,
Commenting out assert len(data) == 4
in the function definition def readline_google_store
.
In def iter_google_store(...)
we need to get the full_number
right for the proper URL of the files. This number depends on the language lang
(german in my case) and ngram_len
; for that case it is
#version = '20120701'
version = '20200217' # New: v3
session = requests.Session()
# Case-By-Case lookup of total number of {gram_len} grams
if(version=='20200217' and ngram_len==1):
full_number = '00008'
if(version=='20200217' and ngram_len==2):
full_number = '00181'
elif(version=='20200217' and ngram_len==3):
full_number = '01369'
elif(version=='20200217' and ngram_len==4):
full_number = '01003'
elif(version=='20200217' and ngram_len==5):
full_number = '02262'
else:
full_number = 0
Printing the line (old version, 20120701
) yields
0 0005_NUM 1901 1 1
which is 4 lines (n-gram, year, count, publication) as asserted in the code and mentioned in the documentation.
The 1st line of the new version has 29 entries though. It took me some time to figure out that these are all year/counts/publication triplets, e.g. `1929,1,1', '1930,5,3', etc.
I summed the counts/publications up across years and used the first year of appearance as the year, i.e.
ngram = data[0]
if(version == '20200217' and lang == 'ger'):
(min_year, count, pubs) = (min([int(data_loc.split(',')[0]) for data_loc in data[1:]]),
sum([int(data_loc.split(',')[1]) for data_loc in data[1:]]),
sum([int(data_loc.split(',')[2]) for data_loc in data[1:]]))
other = [min_year, count, pubs]
# older version (v2/v1)
else:
assert len(data) == 4
other = map(int, data[1:5])
yield Record(ngram, *other)
However, this only happens for the German v3 n-grams (i.e. version = 20200217
).
Thanks for the analysis. I'll have a look what v3
has to offer.
Pull requests are welcome.