google-ngram-downloader Error and termination when hitting an unavailable URL

Error and termination when hitting an unavailable URL

Open kloppjp opened this issue 9 years ago • 4 comments

Background: For simplified Chinese, there is no "bq" combination, hence the downloader will quit with an error message when iterating through the data.

Suggestion: Wouldn't it be nicer if there was a try/catch block around the data retrieval part or the assert would be replaced by an if statement that outputs an error message but allows for jumping to the next file instead?

Apr 04 '15 10:04 kloppjp

Hi,

Thanks for the bug report. The issue is not that trivial to fix because different languages miss different indices. I would avoid a try ... catch block because it might hide real issues, for example when a file that should be retrieved is not retrieved due to poor connection.

For the time being, you can pass indices to readline_google_store:

>>> from google_ngram_downloader import readline_google_store

>>> fname, url, records = next(readline_google_store(ngram_len=5, indices=['cd', 'ed'], lang='chi-sim'))
>>> fname
'googlebooks-chi-sim-all-5gram-20120701-cd.gz'
>>> url
'http://storage.googleapis.com/books/ngrams/books/googlebooks-chi-sim-all-5gram-20120701-cd.gz'
>>> next(records)
Record(ngram='CDP _NOUN_ _NOUN_ _NOUN_ _NOUN_', year=1983, match_count=1, volume_count=1)

Apr 05 '15 14:04 dimazest

It works this way, however I have to make sure that I know all the indices, so on the long term it would still be more handy if the script could check that itself (e.g. download the google ngram page and check whether it contains the links corresponding to the indices? Sounds a bit like overkill, though...) Anyway, thanks for the quick reply, very much appreciated! :)

Apr 06 '15 00:04 kloppjp

I'm very busy right now, but once I get time, I'll just copy the indices from the page.

Apr 08 '15 13:04 dimazest

HI, there is a PR that solves this from my fork pending but you can pip install it in the meantime

> pip install git+git://github.com/tianhuil/google-ngram-downloader.git@master

Sep 09 '19 03:09 tianhuil

google-ngram-downloader google-ngram-downloader copied to clipboard

Error and termination when hitting an unavailable URL

google-ngram-downloader
google-ngram-downloader copied to clipboard