fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

FileExistsError: data-bin/iwslt14.tokenized.de-en/dict.de.txt

Open jack-pan-ai opened this issue 3 years ago • 3 comments

📚 Documentation

In fairseq/examples/translation/, there is a problem regards training a new model:

1) Data Processing | IWSLT'14 German to English (Transformer)

Traceback (most recent call last):
  File "/home/panq/miniconda3/envs/fairseq/bin/fairseq-preprocess", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-preprocess')())
  File "/home/panq/python_project/fairseq/fairseq_cli/preprocess.py", line 391, in cli_main
    main(args)
  File "/home/panq/python_project/fairseq/fairseq_cli/preprocess.py", line 301, in main
    raise FileExistsError(_dict_path(args.source_lang, args.destdir))
FileExistsError: data-bin/iwslt14.tokenized.de-en/dict.de.txt

The problem is that I have that file in the path of '/home/panq/python_project/fairseq/data-bin/iwslt14.tokenized.de-en/dict.de.txt', and run the .py file in the path of '/home/panq/python_project/fairseq/'

2) I tried to revise the fairseq/fairseq_cli/prepocess.py by adding two lines to find the src and tgt dict:

args.srcdict = _dict_path(args.source_lang, args.destdir)
args.tgtdict = _dict_path(args.target_lang, args.destdir)

3) Then it told me that

Traceback (most recent call last):
  File "/home/panq/miniconda3/envs/fairseq/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/panq/python_project/fairseq/fairseq/binarizer.py", line 217, in _binarize_chunk_and_finalize
    ds, summ = cls._binarize_file_chunk(
  File "/home/panq/python_project/fairseq/fairseq/binarizer.py", line 199, in _binarize_file_chunk
    ds.add_item(binarizer.binarize_line(line, summary))
  File "/home/panq/python_project/fairseq/fairseq/binarizer.py", line 275, in binarize_line
    ids = self.dict.encode_line(
  File "/home/panq/python_project/fairseq/fairseq/data/dictionary.py", line 317, in encode_line
    ids = torch.IntTensor(nwords + 1 if append_eos else nwords)
RuntimeError: std::bad_alloc

So, what should I do to train a new model in the translation example?

jack-pan-ai avatar Jan 18 '22 15:01 jack-pan-ai

I met the same problem std::bad_alloc when I follow the instructions of fairseq/examples/translation/. When I try the step

# Binarize the data
TEXT=examples/backtranslation/wmt18_en_de
fairseq-preprocess \
    --joined-dictionary \
    --source-lang en --target-lang de \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir data-bin/wmt18_en_de --thresholdtgt 0 --thresholdsrc 0 \
    --workers 20

The problem came up as

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/zxf/miniconda3/envs/fairseq/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/newdisk/zxf/fairseq-main/fairseq/binarizer.py", line 217, in _binarize_chunk_and_finalize
    ds, summ = cls._binarize_file_chunk(
  File "/home/newdisk/zxf/fairseq-main/fairseq/binarizer.py", line 199, in _binarize_file_chunk
    ds.add_item(binarizer.binarize_line(line, summary))
  File "/home/newdisk/zxf/fairseq-main/fairseq/binarizer.py", line 275, in binarize_line
    ids = self.dict.encode_line(
  File "/home/newdisk/zxf/fairseq-main/fairseq/data/dictionary.py", line 317, in encode_line
    ids = torch.IntTensor(nwords + 1 if append_eos else nwords)
RuntimeError: std::bad_alloc
"""


The above exception was the direct cause of the following exception:


Traceback (most recent call last):
  File "/home/zxf/miniconda3/envs/fairseq/bin/fairseq-preprocess", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-preprocess')())
  File "/home/newdisk/zxf/fairseq-main/fairseq_cli/preprocess.py", line 389, in cli_main
    main(args)
  File "/home/newdisk/zxf/fairseq-main/fairseq_cli/preprocess.py", line 372, in main
    _make_all(args.source_lang, src_dict, args)
  File "/home/newdisk/zxf/fairseq-main/fairseq_cli/preprocess.py", line 185, in _make_all
    _make_dataset(
  File "/home/newdisk/zxf/fairseq-main/fairseq_cli/preprocess.py", line 178, in _make_dataset
    _make_binary_dataset(
  File "/home/newdisk/zxf/fairseq-main/fairseq_cli/preprocess.py", line 119, in _make_binary_dataset
    final_summary = FileBinarizer.multiprocess_dataset(
  File "/home/newdisk/zxf/fairseq-main/fairseq/binarizer.py", line 136, in multiprocess_dataset
    summ = r.get()
  File "/home/zxf/miniconda3/envs/fairseq/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
RuntimeError: std::bad_alloc

It seems that it is a multiprocessing problem of Python. I tried workers=1 to avoid multiprocessing, but I still meet this bad_alloc problem. What should I do to solve this? @jack-pan-ai Have you solved this problem?

ZHANG-GuiGui avatar Mar 18 '22 02:03 ZHANG-GuiGui

I met the same problem. I guess it throws this error exactly because there is already a dict.de.txt file under the corresponding path. If you need to create a different one, just delete the one that's already there. That worked for me.

martianmartina avatar Jul 28 '22 01:07 martianmartina

I met the same problem. I guess it throws this error exactly because there is already a dict.de.txt file under the corresponding path. If you need to create a different one, just delete the one that's already there. That worked for me.

That worked for me too! Thank you!

Skywalker-Harrison avatar Aug 20 '22 00:08 Skywalker-Harrison