fairseq
fairseq copied to clipboard
KeyError: 27707 in pytorch 1.7.1 (GeForce RTX 3090)
🐛 Bug
Not understand the meaning of "KeyError: 27707" , it seems to something wrong in multiple processes? I used the same command line in fairseq0.9.0 , pytorch 1.5 , 2080ti*2 , cuda 10.2. And it's work ! But not work in 3090ti.
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
-
Run cmd
CUDA_VISIBLE_DEVICES=0,1 fairseq-train article_tw2en_1203_zhen_join/ -a transformer_wmt_en_de_big --optimizer adam -s zh -t en --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 2100000 --warmup-updates 10000 --warmup-init-lr '1e-7' --lr '0.001' --share-all-embeddings --adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --clip-norm 25.0 --keep-last-epochs 2 --update-freq 1 --save-dir checkpoint_test |& tee -a test.log
-
See error
2021-01-26 14:12:23 | INFO | fairseq_cli.train | task: translation (TranslationTask)
2021-01-26 14:12:23 | INFO | fairseq_cli.train | model: transformer_wmt_en_de_big (TransformerModel)
2021-01-26 14:12:23 | INFO | fairseq_cli.train | criterion: label_smoothed_cross_entropy (LabelSmoothedCrossEntropyCriterion)
2021-01-26 14:12:23 | INFO | fairseq_cli.train | num. model params: 259964928 (num. trained: 259964928)
2021-01-26 14:12:23 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-01-26 14:12:23 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-01-26 14:12:23 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2021-01-26 14:12:23 | INFO | fairseq.utils | rank 0: capabilities = 8.6 ; total memory = 23.700 GB ; name = GeForce RTX 3090
2021-01-26 14:12:23 | INFO | fairseq.utils | rank 1: capabilities = 8.6 ; total memory = 23.697 GB ; name = GeForce RTX 3090
2021-01-26 14:12:23 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2021-01-26 14:12:23 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
2021-01-26 14:12:23 | INFO | fairseq_cli.train | max tokens per GPU = 4000 and max sentences per GPU = None
2021-01-26 14:12:23 | INFO | fairseq.trainer | no existing checkpoint found checkpoint_test/checkpoint_last.pt
2021-01-26 14:12:23 | INFO | fairseq.trainer | loading train data for epoch 1
2021-01-26 14:12:23 | INFO | fairseq.data.data_utils | loaded 92733 examples from: article_tw2en_1203_zhen_join/train.zh-en.zh
2021-01-26 14:12:23 | INFO | fairseq.data.data_utils | loaded 92733 examples from: article_tw2en_1203_zhen_join/train.zh-en.en
2021-01-26 14:12:23 | INFO | fairseq.tasks.translation | article_tw2en_1203_zhen_join/ train zh-en 92733 examples
/usr/lib/python3/dist-packages/torch/nn/parallel/distributed.py:397: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
/usr/lib/python3/dist-packages/torch/nn/parallel/distributed.py:397: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
Traceback (most recent call last):
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/bin/fairseq-train", line 8, in <module>
sys.exit(cli_main())
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq_cli/train.py", line 352, in cli_main
distributed_utils.call_main(args, main)
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/distributed_utils.py", line 283, in call_main
torch.multiprocessing.spawn(
File "/usr/lib/python3/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/lib/python3/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/usr/lib/python3/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/distributed_utils.py", line 270, in distributed_main
main(args, **kwargs)
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq_cli/train.py", line 110, in main
extra_state, epoch_itr = checkpoint_utils.load_checkpoint(
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 212, in load_checkpoint
epoch_itr = trainer.get_train_iterator(
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/trainer.py", line 382, in get_train_iterator
self.reset_dummy_batch(batch_iterator.first_batch)
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/data/iterators.py", line 288, in first_batch
return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/data/iterators.py", line 288, in <listcomp>
return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/data/language_pair_dataset.py", line 305, in __getitem__
tgt_item = self.tgt[index] if self.tgt is not None else None
File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/data/indexed_dataset.py", line 234, in __getitem__
ptx = self.cache_index[i]
KeyError: 27707
Code sample
Expected behavior
Environment
- fairseq Version 0.10.2
- PyTorch Version 1.7.1
- OS (e.g., Linux): Linux (Ubuntu 20.04 LTS)
- How you installed fairseq (
pip
, source): pip - Build command you used (if compiling from source): pip install fairseq
- Python version: 3.8.5
- CUDA/cuDNN version: 11.2
- GPU models and configuration: Transformer
- Any other relevant information: Run in 2 GeForce RTX 3090
Can you try running the following command:
CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 fairseq-train article_tw2en_1203_zhen_join/ -a transformer_wmt_en_de_big --optimizer adam -s zh -t en --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 2100000 --warmup-updates 10000 --warmup-init-lr '1e-7' --lr '0.001' --share-all-embeddings --adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --clip-norm 25.0 --keep-last-epochs 2 --update-freq 1 --save-dir checkpoint_test |& tee -a test.log --num-workers 0
Note the CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0
environment variables, and --num-workers 0
flag
I tried the command, but get the same error!
CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 fairseq-train article_tw2en_1203_zhen_join/ -a transformer_wmt_en_de_big --optimizer adam -s zh -t en --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 2100000 --warmup-updates 10000 --warmup-init-lr '1e-7' --lr '0.001' --share-all-embeddings --adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --clip-norm 25.0 --keep-last-epochs 2 --update-freq 1 --num-workers 0 --save-dir checkpoint_test |& tee -a test.log
Hello, I am facing the same issue. Below my training params
fairseq-train data_bin --finetune-from-model ./418M_last_checkpoint.pt --save-dir ./checkpoint --task translation_multi_simple_epoch --encoder-normalize-before --langs 'af,am,ar,ast,az,ba,be,bg,bn,br,bs,ca,ceb,cs,cy,da,de,el,en,es,et,fa,ff,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hr,ht,hu,hy,id,ig,ilo,is,it,ja,jv,ka,kk,km,kn,ko,lb,lg,ln,lo,lt,lv,mg,mk,ml,mn,mr,ms,my,ne,nl,no,ns,oc,or,pa,pl,ps,pt,ro,ru,sd,si,sk,sl,so,sq,sr,ss,su,sv,sw,ta,th,tl,tn,tr,uk,ur,uz,vi,wo,xh,yi,yo,zh,zu' --lang-pairs 'en-fr,fr-en' --max-tokens 1200 --decoder-normalize-before --sampling-method temperature --sampling-temperature 1.5 --encoder-langtok src --decoder-langtok --criterion label_smoothed_cross_entropy --label-smoothing 0.2 --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr 3e-05 --warmup-updates 2500 --max-update 40000 --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 --update-freq 2 --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints --seed 222 --log-format simple --log-interval 2 --patience 10 --arch transformer_wmt_en_de_big --encoder-layers 12 --decoder-layers 12 --encoder-layerdrop 0.05 --decoder-layerdrop 0.05 --share-decoder-input-output-embed --share-all-embeddings --ddp-backend no_c10d
I tried the solution proposed by @lematt1991 but still having the same error.
Thank you a lot in advance
According to my experience,s this question is about environment.
Pytorch version have to match your cuda version.
Follow the command line : https://pytorch.org/get-started/previous-versions/
Sometimes if you using pip to install would produce other problem, then you can try conda.