PyTorch-NLP
PyTorch-NLP copied to clipboard
wmt_dataset download failed
Expected Behavior
- I tried to follow example of pytorch nlp documentation with wmt14 dataset. (https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html)
- download wmt dataset successfully
Actual Behavior
- wmt_dataset [DOWNLOAD_FAILED] occurs.
Steps to Reproduce the Problem
- install pytorch-nlp 0.5.0
-
from torchnlp.datasets import wmt_dataset
-
train=wmt_dataset(train=True)
>>> train = wmt_dataset(train=True)
tar: Error opening archive: Unrecognized archive format
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/torchnlp/datasets/wmt.py", line 63, in wmt_dataset
download_file_maybe_extract(
File "/usr/local/lib/python3.9/site-packages/torchnlp/download.py", line 170, in download_file_maybe_extract
raise ValueError('[DOWNLOAD FAILED] `*check_files` not found')
ValueError: [DOWNLOAD FAILED] `*check_files` not found
In torchnlp/download.py
def _download_file_from_drive(filename, url): # pragma: no cover
""" Download filename from google drive unless it's already in directory.
Args:
filename (str): Name of the file to download to (do nothing if it already exists).
url (str): URL to download from.
"""
confirm_token = None
# Since the file is big, drive will scan it for virus and take it to a
# warning page. We find the confirm token on this page and append it to the
# URL to start the download process.
confirm_token = None
session = requests.Session()
response = session.get(url, stream=True)
for k, v in response.cookies.items():
if k.startswith("download_warning"):
confirm_token = v
if confirm_token:
url = url + "&confirm=" + confirm_token
logger.info("Downloading %s to %s" % (url, filename))
response = session.get(url, stream=True)
# Now begin the download.
chunk_size = 16 * 1024
with open(filename, "wb") as f:
for chunk in response.iter_content(chunk_size):
if chunk:
f.write(chunk)
# Print newline to clear the carriage return from the download progress
statinfo = os.stat(filename)
logger.info("Successfully downloaded %s, %s bytes." % (filename, statinfo.st_size))
I checked the not found *check_files
Result
data/wmt16_en_de/train.tok.clean.bpe.32000.en Extracting data/wmt16_en_de/wmt16_en_de.tar.gz tar: Error opening archive: Unrecognized archive format data/wmt16_en_de/train.tok.clean.bpe.32000.en
'data/wmt16_en_de/wmt16_en_de.tar.gz' file forms HTML document text, ASCII text
open file url 'https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8' in documentation with wet dataset. it was 404 found page.
this bug is occurred by documentation wmt data url.
Any update on this?