tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

wikitext-103-raw-v1.zip is not available on the amazonaws anymore

Open gec1-dev opened this issue 1 year ago • 4 comments

The raw dataset wikitext-103-raw-v1.zip is not available for download on amazonaws from what I see anywhere on the internet. I see other people complaining about this raw dataset disappear from internet on different repos and I don't know if this is permanent and/or new dataset should be used in this tutorial example.

https://github.com/huggingface/tokenizers/blob/f0c48bd89a442819b39605ca117ecabd293bfdd7/docs/source-doc-builder/quicktour.mdx?plain=1#L15

I receive the following error, when trying to wget the file:

wget --trust-server-names https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip --2024-11-18 15:28:28-- https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.172.216, 16.182.32.200, 52.217.112.120, ... Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.172.216|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2024-11-18 15:28:28 ERROR 403: Forbidden.

gec1-dev avatar Nov 18 '24 14:11 gec1-dev

The file is available at 'https://dax-cdn.cdn.appdomain.cloud/dax-wikitext-103/1.0.1/wikitext-103.tar.gz'

sudarsun avatar May 27 '25 12:05 sudarsun

Do you want to open a pR to update?

ArthurZucker avatar May 27 '25 13:05 ArthurZucker

The file is available at 'https://dax-cdn.cdn.appdomain.cloud/dax-wikitext-103/1.0.1/wikitext-103.tar.gz'

"We are having trouble finding that site"

Axelfoley85 avatar Jul 25 '25 15:07 Axelfoley85

I found it here .

darucaile avatar Aug 10 '25 09:08 darucaile