annotated-transformer
annotated-transformer copied to clipboard
Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.
I ran the code on Google colab.
When building German vocabulary here:
if is_interactive_notebook():
# global variables used later in the script
spacy_de, spacy_en = show_example(load_tokenizers)
vocab_src, vocab_tgt = show_example(load_vocab, args=[spacy_de, spacy_en])
This error showed up:
Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.
Is this problem with torchtext? I found that this error occurred when calling this line:
vocab_src = build_vocab_from_iterator(
yield_tokens(train + val + test, tokenize_de, index=0),
min_freq=2,
specials=["<s>", "</s>", "<blank>", "<unk>"],
)
Thank you in advance.
I am having the same problem. It seems that site:
http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz
is no longer available.
The maintainer of this repository:
https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/datasets/multi30k.py
writes:
"Host www.quest.dcs.shef.ac.uk
forgot to update their SSL certificate; therefore, this dataset does not download securely."
Hope this offers some insight into the problem.
Thank you for the info @aambrioso1
@tiassap I ran into the same problem as what you explained. Did you find another way around to access those files?
I was able to get the code to work by using another data file. The basic idea is that the training, validation, and test sets are all lists of tuples. The tuples consist of sentence pairs in each language. This insight is nice since it makes it easy to create any language pairing you would like. Here is my implementation in Colab along with lots of notes:
https://colab.research.google.com/drive/131hohvAKRqzHg4K3_68UGL4oi4SGOB45?usp=sharing
Thank you @aambrioso1. It is very helpful.
So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)]
and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/.
And dataset training, val, and test are declared as global variable.
Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset.
The problem is because the url source of Torchtext.datasets.Multi30k()
is not accessible. Let's hope it will be fixed soon.
Thank you @aambrioso1. It is very helpful.
So we can use other dataset as well with data format
[(de_1, eng_1), ..., (de_n, en_n)]
and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable.Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset.
The problem is because the url source of
Torchtext.datasets.Multi30k()
is not accessible. Let's hope it will be fixed soon.
How can I download in colab? I mean what change i need to to in code to download?
train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))
Thank you @aambrioso1. It is very helpful. So we can use other dataset as well with data format
[(de_1, eng_1), ..., (de_n, en_n)]
and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable. Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset. The problem is because the url source ofTorchtext.datasets.Multi30k()
is not accessible. Let's hope it will be fixed soon.How can I download in colab? I mean what change i need to to in code to download?
train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))
from torchtext.datasets import multi30k
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz" multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz" multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"
multi30k.MD5["train"] = "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e" multi30k.MD5["valid"] = "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c" multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"
https://discuss.pytorch.org/t/build-vocab-from-iterator-does-not-work-in-notebook/153575/16
Thank you @aambrioso1. It is very helpful. So we can use other dataset as well with data format
[(de_1, eng_1), ..., (de_n, en_n)]
and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable. Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset. The problem is because the url source ofTorchtext.datasets.Multi30k()
is not accessible. Let's hope it will be fixed soon.How can I download in colab? I mean what change i need to to in code to download? train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))
from torchtext.datasets import multi30k
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz" multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz" multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"
multi30k.MD5["train"] = "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e" multi30k.MD5["valid"] = "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c" multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"
https://discuss.pytorch.org/t/build-vocab-from-iterator-does-not-work-in-notebook/153575/16
Thanks! It works!