google-ngram-downloader
google-ngram-downloader copied to clipboard
After the interruption, download from beginning
I have downloaded parts of the zip files, while download processing some error occured,when I restart the download process,it will from the first one to download.so ,here is my Temporary solution: inside download function, for fname, url, request in iter_google_store(ngram_len, verbose=verbose, lang=lang): # add this new if sentence to check if os.path.exists(str(output.join(fname))): print('already exist') continue else: with output.join(fname).open('wb') as f: print(output.join(fname),'downloading...') for num, chunk in enumerate(request.iter_content(1024)): if verbose and not divmod(num, 1024)[1]: sys.stderr.write('.') sys.stderr.flush() f.write(chunk) Maybe this question has been handled,but are there any better solutions.thanks.
HI, there is a PR that solves this from my fork pending but you can pip install it in the meantime
> pip install git+git://github.com/tianhuil/google-ngram-downloader.git@master
@tianhuil If I install your fork above will I be able to run
google-ngram-downloader download -n 3 -o .
in a directory where I already have some of the length 3 ngrams downloaded? Or do I need to specify that I am using your specific version to get the functionality where the downloads will not restart from the beginning?
If anyone else happens upon this post. I wanted a way to be able to stop the downloads and then come back and continue downloading where I'd left off. This is useful when downloading any nGrams greater than size 1 since they take many hours to download. The current implementation just restarts from the very first ngram. If you update the util.py
class and add
- This import to use Os functions
import os
- This simple if check inside the
iter_google_store
for loop
if os.path.isfile(fname):
sys.stderr.write(fname)
continue
and re-run the command in an output file where there are already some ngrams downloaded it will continue downloading at the next undownloaded ngram.
Here is some sample output in a directory where I'd downlaoded A, B and some C ngrams:
/Volumes/Seagate » google-ngram-downloader download -n 3 -o . -v @MacBook-Pro-5
googlebooks-eng-all-3gram-20120701-0.gz
googlebooks-eng-all-3gram-20120701-1.gz
googlebooks-eng-all-3gram-20120701-2.gz
googlebooks-eng-all-3gram-20120701-3.gz
googlebooks-eng-all-3gram-20120701-4.gz
googlebooks-eng-all-3gram-20120701-5.gz
googlebooks-eng-all-3gram-20120701-6.gz
googlebooks-eng-all-3gram-20120701-7.gz
googlebooks-eng-all-3gram-20120701-8.gz
googlebooks-eng-all-3gram-20120701-9.gz
googlebooks-eng-all-3gram-20120701-aa.gz
googlebooks-eng-all-3gram-20120701-ab.gz
googlebooks-eng-all-3gram-20120701-ac.gz
googlebooks-eng-all-3gram-20120701-ad.gz
googlebooks-eng-all-3gram-20120701-ae.gz
googlebooks-eng-all-3gram-20120701-af.gz
googlebooks-eng-all-3gram-20120701-ag.gz
googlebooks-eng-all-3gram-20120701-ah.gz
googlebooks-eng-all-3gram-20120701-ai.gz
googlebooks-eng-all-3gram-20120701-aj.gz
googlebooks-eng-all-3gram-20120701-ak.gz
googlebooks-eng-all-3gram-20120701-al.gz
googlebooks-eng-all-3gram-20120701-am.gz
googlebooks-eng-all-3gram-20120701-an.gz
googlebooks-eng-all-3gram-20120701-ao.gz
googlebooks-eng-all-3gram-20120701-ap.gz
googlebooks-eng-all-3gram-20120701-aq.gz
googlebooks-eng-all-3gram-20120701-ar.gz
googlebooks-eng-all-3gram-20120701-as.gz
googlebooks-eng-all-3gram-20120701-at.gz
googlebooks-eng-all-3gram-20120701-au.gz
googlebooks-eng-all-3gram-20120701-av.gz
googlebooks-eng-all-3gram-20120701-aw.gz
googlebooks-eng-all-3gram-20120701-ax.gz
googlebooks-eng-all-3gram-20120701-ay.gz
googlebooks-eng-all-3gram-20120701-az.gz
googlebooks-eng-all-3gram-20120701-a_.gz
googlebooks-eng-all-3gram-20120701-ba.gz
googlebooks-eng-all-3gram-20120701-bb.gz
googlebooks-eng-all-3gram-20120701-bc.gz
googlebooks-eng-all-3gram-20120701-bd.gz
googlebooks-eng-all-3gram-20120701-be.gz
googlebooks-eng-all-3gram-20120701-bf.gz
googlebooks-eng-all-3gram-20120701-bg.gz
googlebooks-eng-all-3gram-20120701-bh.gz
googlebooks-eng-all-3gram-20120701-bi.gz
googlebooks-eng-all-3gram-20120701-bj.gz
googlebooks-eng-all-3gram-20120701-bk.gz
googlebooks-eng-all-3gram-20120701-bl.gz
googlebooks-eng-all-3gram-20120701-bm.gz
googlebooks-eng-all-3gram-20120701-bn.gz
googlebooks-eng-all-3gram-20120701-bo.gz
googlebooks-eng-all-3gram-20120701-bp.gz
googlebooks-eng-all-3gram-20120701-bq.gz
googlebooks-eng-all-3gram-20120701-br.gz
googlebooks-eng-all-3gram-20120701-bs.gz
googlebooks-eng-all-3gram-20120701-bt.gz
googlebooks-eng-all-3gram-20120701-bu.gz
googlebooks-eng-all-3gram-20120701-bv.gz
googlebooks-eng-all-3gram-20120701-bw.gz
googlebooks-eng-all-3gram-20120701-bx.gz
googlebooks-eng-all-3gram-20120701-by.gz
googlebooks-eng-all-3gram-20120701-bz.gz
googlebooks-eng-all-3gram-20120701-b_.gz
googlebooks-eng-all-3gram-20120701-ca.gz
googlebooks-eng-all-3gram-20120701-cb.gz
googlebooks-eng-all-3gram-20120701-cc.gz
googlebooks-eng-all-3gram-20120701-cd.gz
googlebooks-eng-all-3gram-20120701-ce.gz
googlebooks-eng-all-3gram-20120701-cf.gz
googlebooks-eng-all-3gram-20120701-cg.gz
googlebooks-eng-all-3gram-20120701-ch.gz
googlebooks-eng-all-3gram-20120701-ci.gz
googlebooks-eng-all-3gram-20120701-cj.gz
googlebooks-eng-all-3gram-20120701-ck.gz
googlebooks-eng-all-3gram-20120701-cl.gz
googlebooks-eng-all-3gram-20120701-cm.gz
googlebooks-eng-all-3gram-20120701-cn.gz
Downloading http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20120701-co.gz .
...continues downloading the rest of the ngrams beginning at co