d2l-en icon indicating copy to clipboard operation
d2l-en copied to clipboard

WikiText-2 is not a zip file

Open CharryLee0426 opened this issue 11 months ago • 3 comments

When I executed the following part:

from d2l import torch as d2l

batch_size, max_len = 512, 64
train_iter, vocab = d2l.load_data_wiki(batch_size, max_len)
from d2l import mxnet as d2l

batch_size, max_len = 512, 64
train_iter, vocab = d2l.load_data_wiki(batch_size, max_len)

I met this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/charry/miniconda3/envs/d2l/lib/python3.9/site-packages/d2l/torch.py", line 2443, in load_data_wiki
    data_dir = d2l.download_extract('wikitext-2', 'wikitext-2')
  File "/home/charry/miniconda3/envs/d2l/lib/python3.9/site-packages/d2l/torch.py", line 3247, in download_extract
    fp = zipfile.ZipFile(fname, 'r')
  File "/home/charry/miniconda3/envs/d2l/lib/python3.9/zipfile.py", line 1266, in __init__
    self._RealGetContents()
  File "/home/charry/miniconda3/envs/d2l/lib/python3.9/zipfile.py", line 1333, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

I think it is because the dataset in the server has been damaged. I reimplemented this error with d2l 1.0.0 - 1.0.3. And it will cause some errors when WikiText-2 dataset is needed.

I have a pull request failed due to this error. I also mentioned that there are some pull requests related fixing typo errors also failed check due to this error.

I hope this error can be fixed as soon as possible.

CharryLee0426 avatar Mar 07 '24 18:03 CharryLee0426

The wikitext-2 dataset URL returns this error:

<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>MM9XHEKPABYT4NPW</RequestId>
<HostId>KOjOK6r2VNkvN6gS28B7s2akq8hULUJohhsiCnyrL9RMzjk3RAIvYnVZiHGd6PPVEIDnQHTijnI=</HostId>
</Error>

CharryLee0426 avatar Mar 07 '24 22:03 CharryLee0426

Having the same issue. Is there an updated URL we can use?

donny-nyc avatar May 05 '24 01:05 donny-nyc

Same issue here. According the book, the dataset is from

Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models. ArXiv:1609.07843.

In that paper, http://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/ is linked and this site can't be reached anymore. Hence, likewise https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip isn't anymore. Anyone has a good mirror for this?

MassEast avatar Jun 19 '24 09:06 MassEast