annotated-transformer icon indicating copy to clipboard operation
annotated-transformer copied to clipboard

Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.

Open tiassap opened this issue 2 years ago • 8 comments

I ran the code on Google colab.

When building German vocabulary here:

if is_interactive_notebook():
    # global variables used later in the script
    spacy_de, spacy_en = show_example(load_tokenizers)
    vocab_src, vocab_tgt = show_example(load_vocab, args=[spacy_de, spacy_en])

This error showed up:

Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.

Is this problem with torchtext? I found that this error occurred when calling this line:

vocab_src = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_de, index=0),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],
    )

Thank you in advance.

tiassap avatar Jun 14 '22 05:06 tiassap

I am having the same problem. It seems that site:

http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz

is no longer available.

The maintainer of this repository:

https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/datasets/multi30k.py

writes:

"Host www.quest.dcs.shef.ac.uk forgot to update their SSL certificate; therefore, this dataset does not download securely."

Hope this offers some insight into the problem.

aambrioso1 avatar Jun 15 '22 15:06 aambrioso1

Thank you for the info @aambrioso1

tiassap avatar Jun 21 '22 04:06 tiassap

@tiassap I ran into the same problem as what you explained. Did you find another way around to access those files?

youbinaa avatar Jun 21 '22 10:06 youbinaa

I was able to get the code to work by using another data file. The basic idea is that the training, validation, and test sets are all lists of tuples. The tuples consist of sentence pairs in each language. This insight is nice since it makes it easy to create any language pairing you would like. Here is my implementation in Colab along with lots of notes:

https://colab.research.google.com/drive/131hohvAKRqzHg4K3_68UGL4oi4SGOB45?usp=sharing

aambrioso1 avatar Jun 23 '22 15:06 aambrioso1

Thank you @aambrioso1. It is very helpful.

So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable.

Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset.

The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

tiassap avatar Jun 24 '22 07:06 tiassap

Thank you @aambrioso1. It is very helpful.

So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable.

Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset.

The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

How can I download in colab? I mean what change i need to to in code to download?

train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))

EsmaeilChitgar avatar Oct 27 '23 07:10 EsmaeilChitgar

Thank you @aambrioso1. It is very helpful. So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable. Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset. The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

How can I download in colab? I mean what change i need to to in code to download?

train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))

from torchtext.datasets import multi30k

multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz" multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz" multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

multi30k.MD5["train"] = "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e" multi30k.MD5["valid"] = "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c" multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

https://discuss.pytorch.org/t/build-vocab-from-iterator-does-not-work-in-notebook/153575/16

g-i-o-r-g-i-o avatar Nov 27 '23 18:11 g-i-o-r-g-i-o

Thank you @aambrioso1. It is very helpful. So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable. Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset. The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

How can I download in colab? I mean what change i need to to in code to download? train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))

from torchtext.datasets import multi30k

multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz" multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz" multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

multi30k.MD5["train"] = "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e" multi30k.MD5["valid"] = "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c" multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

https://discuss.pytorch.org/t/build-vocab-from-iterator-does-not-work-in-notebook/153575/16

Thanks! It works!

minsuk-sung avatar Mar 31 '24 08:03 minsuk-sung