annotated-transformer Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.

I ran the code on Google colab.

When building German vocabulary here:

if is_interactive_notebook():
    # global variables used later in the script
    spacy_de, spacy_en = show_example(load_tokenizers)
    vocab_src, vocab_tgt = show_example(load_vocab, args=[spacy_de, spacy_en])

This error showed up:

Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.

Is this problem with torchtext? I found that this error occurred when calling this line:

vocab_src = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_de, index=0),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],
    )

Thank you in advance.

Jun 14 '22 05:06 tiassap

I am having the same problem. It seems that site:

http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz

is no longer available.

The maintainer of this repository:

https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/datasets/multi30k.py

writes:

"Host www.quest.dcs.shef.ac.uk forgot to update their SSL certificate; therefore, this dataset does not download securely."

Hope this offers some insight into the problem.

Jun 15 '22 15:06 aambrioso1

Thank you for the info @aambrioso1

Jun 21 '22 04:06 tiassap

@tiassap I ran into the same problem as what you explained. Did you find another way around to access those files?

Jun 21 '22 10:06 youbinaa

I was able to get the code to work by using another data file. The basic idea is that the training, validation, and test sets are all lists of tuples. The tuples consist of sentence pairs in each language. This insight is nice since it makes it easy to create any language pairing you would like. Here is my implementation in Colab along with lots of notes:

https://colab.research.google.com/drive/131hohvAKRqzHg4K3_68UGL4oi4SGOB45?usp=sharing

Jun 23 '22 15:06 aambrioso1

Thank you @aambrioso1. It is very helpful.

So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable.

Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset.

The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

Jun 24 '22 07:06 tiassap

Thank you @aambrioso1. It is very helpful.

So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable.

Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset.

The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

How can I download in colab? I mean what change i need to to in code to download?

train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))

Oct 27 '23 07:10 EsmaeilChitgar

Thank you @aambrioso1. It is very helpful. So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable. Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset. The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

How can I download in colab? I mean what change i need to to in code to download?

train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))

from torchtext.datasets import multi30k

multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz" multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz" multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

multi30k.MD5["train"] = "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e" multi30k.MD5["valid"] = "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c" multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

https://discuss.pytorch.org/t/build-vocab-from-iterator-does-not-work-in-notebook/153575/16

Nov 27 '23 18:11 g-i-o-r-g-i-o

Thank you @aambrioso1. It is very helpful. So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable. Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset. The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

How can I download in colab? I mean what change i need to to in code to download? train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))

from torchtext.datasets import multi30k

multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz" multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz" multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

multi30k.MD5["train"] = "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e" multi30k.MD5["valid"] = "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c" multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

https://discuss.pytorch.org/t/build-vocab-from-iterator-does-not-work-in-notebook/153575/16

Thanks! It works!

Mar 31 '24 08:03 minsuk-sung

annotated-transformer annotated-transformer copied to clipboard

Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.

annotated-transformer
annotated-transformer copied to clipboard