google-ngram-downloader icon indicating copy to clipboard operation
google-ngram-downloader copied to clipboard

After the interruption, download from beginning

Open opconty opened this issue 7 years ago • 3 comments

I have downloaded parts of the zip files, while download processing some error occured,when I restart the download process,it will from the first one to download.so ,here is my Temporary solution: inside download function, for fname, url, request in iter_google_store(ngram_len, verbose=verbose, lang=lang): # add this new if sentence to check if os.path.exists(str(output.join(fname))): print('already exist') continue else: with output.join(fname).open('wb') as f: print(output.join(fname),'downloading...') for num, chunk in enumerate(request.iter_content(1024)): if verbose and not divmod(num, 1024)[1]: sys.stderr.write('.') sys.stderr.flush() f.write(chunk) Maybe this question has been handled,but are there any better solutions.thanks.

opconty avatar Aug 21 '17 09:08 opconty

HI, there is a PR that solves this from my fork pending but you can pip install it in the meantime

> pip install git+git://github.com/tianhuil/google-ngram-downloader.git@master

tianhuil avatar Sep 09 '19 03:09 tianhuil

@tianhuil If I install your fork above will I be able to run

google-ngram-downloader download -n 3 -o .

in a directory where I already have some of the length 3 ngrams downloaded? Or do I need to specify that I am using your specific version to get the functionality where the downloads will not restart from the beginning?

lehigh123 avatar Feb 28 '21 20:02 lehigh123

If anyone else happens upon this post. I wanted a way to be able to stop the downloads and then come back and continue downloading where I'd left off. This is useful when downloading any nGrams greater than size 1 since they take many hours to download. The current implementation just restarts from the very first ngram. If you update the util.py class and add

  1. This import to use Os functions
import os
  1. This simple if check inside the iter_google_store for loop
        if os.path.isfile(fname):
            sys.stderr.write(fname)
            continue

and re-run the command in an output file where there are already some ngrams downloaded it will continue downloading at the next undownloaded ngram.

Here is some sample output in a directory where I'd downlaoded A, B and some C ngrams:

/Volumes/Seagate » google-ngram-downloader download -n 3 -o . -v                                                                                                    @MacBook-Pro-5
googlebooks-eng-all-3gram-20120701-0.gz
googlebooks-eng-all-3gram-20120701-1.gz
googlebooks-eng-all-3gram-20120701-2.gz
googlebooks-eng-all-3gram-20120701-3.gz
googlebooks-eng-all-3gram-20120701-4.gz
googlebooks-eng-all-3gram-20120701-5.gz
googlebooks-eng-all-3gram-20120701-6.gz
googlebooks-eng-all-3gram-20120701-7.gz
googlebooks-eng-all-3gram-20120701-8.gz
googlebooks-eng-all-3gram-20120701-9.gz
googlebooks-eng-all-3gram-20120701-aa.gz
googlebooks-eng-all-3gram-20120701-ab.gz
googlebooks-eng-all-3gram-20120701-ac.gz
googlebooks-eng-all-3gram-20120701-ad.gz
googlebooks-eng-all-3gram-20120701-ae.gz
googlebooks-eng-all-3gram-20120701-af.gz
googlebooks-eng-all-3gram-20120701-ag.gz
googlebooks-eng-all-3gram-20120701-ah.gz
googlebooks-eng-all-3gram-20120701-ai.gz
googlebooks-eng-all-3gram-20120701-aj.gz
googlebooks-eng-all-3gram-20120701-ak.gz
googlebooks-eng-all-3gram-20120701-al.gz
googlebooks-eng-all-3gram-20120701-am.gz
googlebooks-eng-all-3gram-20120701-an.gz
googlebooks-eng-all-3gram-20120701-ao.gz
googlebooks-eng-all-3gram-20120701-ap.gz
googlebooks-eng-all-3gram-20120701-aq.gz
googlebooks-eng-all-3gram-20120701-ar.gz
googlebooks-eng-all-3gram-20120701-as.gz
googlebooks-eng-all-3gram-20120701-at.gz
googlebooks-eng-all-3gram-20120701-au.gz
googlebooks-eng-all-3gram-20120701-av.gz
googlebooks-eng-all-3gram-20120701-aw.gz
googlebooks-eng-all-3gram-20120701-ax.gz
googlebooks-eng-all-3gram-20120701-ay.gz
googlebooks-eng-all-3gram-20120701-az.gz
googlebooks-eng-all-3gram-20120701-a_.gz
googlebooks-eng-all-3gram-20120701-ba.gz
googlebooks-eng-all-3gram-20120701-bb.gz
googlebooks-eng-all-3gram-20120701-bc.gz
googlebooks-eng-all-3gram-20120701-bd.gz
googlebooks-eng-all-3gram-20120701-be.gz
googlebooks-eng-all-3gram-20120701-bf.gz
googlebooks-eng-all-3gram-20120701-bg.gz
googlebooks-eng-all-3gram-20120701-bh.gz
googlebooks-eng-all-3gram-20120701-bi.gz
googlebooks-eng-all-3gram-20120701-bj.gz
googlebooks-eng-all-3gram-20120701-bk.gz
googlebooks-eng-all-3gram-20120701-bl.gz
googlebooks-eng-all-3gram-20120701-bm.gz
googlebooks-eng-all-3gram-20120701-bn.gz
googlebooks-eng-all-3gram-20120701-bo.gz
googlebooks-eng-all-3gram-20120701-bp.gz
googlebooks-eng-all-3gram-20120701-bq.gz
googlebooks-eng-all-3gram-20120701-br.gz
googlebooks-eng-all-3gram-20120701-bs.gz
googlebooks-eng-all-3gram-20120701-bt.gz
googlebooks-eng-all-3gram-20120701-bu.gz
googlebooks-eng-all-3gram-20120701-bv.gz
googlebooks-eng-all-3gram-20120701-bw.gz
googlebooks-eng-all-3gram-20120701-bx.gz
googlebooks-eng-all-3gram-20120701-by.gz
googlebooks-eng-all-3gram-20120701-bz.gz
googlebooks-eng-all-3gram-20120701-b_.gz
googlebooks-eng-all-3gram-20120701-ca.gz
googlebooks-eng-all-3gram-20120701-cb.gz
googlebooks-eng-all-3gram-20120701-cc.gz
googlebooks-eng-all-3gram-20120701-cd.gz
googlebooks-eng-all-3gram-20120701-ce.gz
googlebooks-eng-all-3gram-20120701-cf.gz
googlebooks-eng-all-3gram-20120701-cg.gz
googlebooks-eng-all-3gram-20120701-ch.gz
googlebooks-eng-all-3gram-20120701-ci.gz
googlebooks-eng-all-3gram-20120701-cj.gz
googlebooks-eng-all-3gram-20120701-ck.gz
googlebooks-eng-all-3gram-20120701-cl.gz
googlebooks-eng-all-3gram-20120701-cm.gz
googlebooks-eng-all-3gram-20120701-cn.gz
Downloading http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20120701-co.gz .

...continues downloading the rest of the ngrams beginning at co

lehigh123 avatar Mar 02 '21 03:03 lehigh123