google-ngram-downloader
google-ngram-downloader copied to clipboard
Error and termination when hitting an unavailable URL
Background: For simplified Chinese, there is no "bq" combination, hence the downloader will quit with an error message when iterating through the data.
Suggestion: Wouldn't it be nicer if there was a try/catch block around the data retrieval part or the assert would be replaced by an if statement that outputs an error message but allows for jumping to the next file instead?
Hi,
Thanks for the bug report. The issue is not that trivial to fix because different languages miss different indices. I would avoid a try ... catch block because it might hide real issues, for example when a file that should be retrieved is not retrieved due to poor connection.
For the time being, you can pass indices to readline_google_store
:
>>> from google_ngram_downloader import readline_google_store
>>> fname, url, records = next(readline_google_store(ngram_len=5, indices=['cd', 'ed'], lang='chi-sim'))
>>> fname
'googlebooks-chi-sim-all-5gram-20120701-cd.gz'
>>> url
'http://storage.googleapis.com/books/ngrams/books/googlebooks-chi-sim-all-5gram-20120701-cd.gz'
>>> next(records)
Record(ngram='CDP _NOUN_ _NOUN_ _NOUN_ _NOUN_', year=1983, match_count=1, volume_count=1)
It works this way, however I have to make sure that I know all the indices, so on the long term it would still be more handy if the script could check that itself (e.g. download the google ngram page and check whether it contains the links corresponding to the indices? Sounds a bit like overkill, though...) Anyway, thanks for the quick reply, very much appreciated! :)
I'm very busy right now, but once I get time, I'll just copy the indices from the page.
HI, there is a PR that solves this from my fork pending but you can pip install it in the meantime
> pip install git+git://github.com/tianhuil/google-ngram-downloader.git@master