fairseq
fairseq copied to clipboard
FileExistsError: data-bin/iwslt14.tokenized.de-en/dict.de.txt
📚 Documentation
In fairseq/examples/translation/, there is a problem regards training a new model:
1) Data Processing | IWSLT'14 German to English (Transformer)
Traceback (most recent call last):
File "/home/panq/miniconda3/envs/fairseq/bin/fairseq-preprocess", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-preprocess')())
File "/home/panq/python_project/fairseq/fairseq_cli/preprocess.py", line 391, in cli_main
main(args)
File "/home/panq/python_project/fairseq/fairseq_cli/preprocess.py", line 301, in main
raise FileExistsError(_dict_path(args.source_lang, args.destdir))
FileExistsError: data-bin/iwslt14.tokenized.de-en/dict.de.txt
The problem is that I have that file in the path of '/home/panq/python_project/fairseq/data-bin/iwslt14.tokenized.de-en/dict.de.txt', and run the .py file in the path of '/home/panq/python_project/fairseq/'
2) I tried to revise the fairseq/fairseq_cli/prepocess.py by adding two lines to find the src and tgt dict:
args.srcdict = _dict_path(args.source_lang, args.destdir)
args.tgtdict = _dict_path(args.target_lang, args.destdir)
3) Then it told me that
Traceback (most recent call last):
File "/home/panq/miniconda3/envs/fairseq/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/panq/python_project/fairseq/fairseq/binarizer.py", line 217, in _binarize_chunk_and_finalize
ds, summ = cls._binarize_file_chunk(
File "/home/panq/python_project/fairseq/fairseq/binarizer.py", line 199, in _binarize_file_chunk
ds.add_item(binarizer.binarize_line(line, summary))
File "/home/panq/python_project/fairseq/fairseq/binarizer.py", line 275, in binarize_line
ids = self.dict.encode_line(
File "/home/panq/python_project/fairseq/fairseq/data/dictionary.py", line 317, in encode_line
ids = torch.IntTensor(nwords + 1 if append_eos else nwords)
RuntimeError: std::bad_alloc
So, what should I do to train a new model in the translation example?
I met the same problem std::bad_alloc when I follow the instructions of fairseq/examples/translation/.
When I try the step
# Binarize the data
TEXT=examples/backtranslation/wmt18_en_de
fairseq-preprocess \
--joined-dictionary \
--source-lang en --target-lang de \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/wmt18_en_de --thresholdtgt 0 --thresholdsrc 0 \
--workers 20
The problem came up as
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/zxf/miniconda3/envs/fairseq/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/newdisk/zxf/fairseq-main/fairseq/binarizer.py", line 217, in _binarize_chunk_and_finalize
ds, summ = cls._binarize_file_chunk(
File "/home/newdisk/zxf/fairseq-main/fairseq/binarizer.py", line 199, in _binarize_file_chunk
ds.add_item(binarizer.binarize_line(line, summary))
File "/home/newdisk/zxf/fairseq-main/fairseq/binarizer.py", line 275, in binarize_line
ids = self.dict.encode_line(
File "/home/newdisk/zxf/fairseq-main/fairseq/data/dictionary.py", line 317, in encode_line
ids = torch.IntTensor(nwords + 1 if append_eos else nwords)
RuntimeError: std::bad_alloc
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/zxf/miniconda3/envs/fairseq/bin/fairseq-preprocess", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-preprocess')())
File "/home/newdisk/zxf/fairseq-main/fairseq_cli/preprocess.py", line 389, in cli_main
main(args)
File "/home/newdisk/zxf/fairseq-main/fairseq_cli/preprocess.py", line 372, in main
_make_all(args.source_lang, src_dict, args)
File "/home/newdisk/zxf/fairseq-main/fairseq_cli/preprocess.py", line 185, in _make_all
_make_dataset(
File "/home/newdisk/zxf/fairseq-main/fairseq_cli/preprocess.py", line 178, in _make_dataset
_make_binary_dataset(
File "/home/newdisk/zxf/fairseq-main/fairseq_cli/preprocess.py", line 119, in _make_binary_dataset
final_summary = FileBinarizer.multiprocess_dataset(
File "/home/newdisk/zxf/fairseq-main/fairseq/binarizer.py", line 136, in multiprocess_dataset
summ = r.get()
File "/home/zxf/miniconda3/envs/fairseq/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
RuntimeError: std::bad_alloc
It seems that it is a multiprocessing problem of Python. I tried workers=1 to avoid multiprocessing, but I still meet this bad_alloc problem.
What should I do to solve this?
@jack-pan-ai Have you solved this problem?
I met the same problem. I guess it throws this error exactly because there is already a dict.de.txt file under the corresponding path. If you need to create a different one, just delete the one that's already there. That worked for me.
I met the same problem. I guess it throws this error exactly because there is already a dict.de.txt file under the corresponding path. If you need to create a different one, just delete the one that's already there. That worked for me.
That worked for me too! Thank you!