fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

KeyError: 27707 in pytorch 1.7.1 (GeForce RTX 3090)

Open Daisy-123 opened this issue 3 years ago • 4 comments

🐛 Bug

Not understand the meaning of "KeyError: 27707" , it seems to something wrong in multiple processes? I used the same command line in fairseq0.9.0 , pytorch 1.5 , 2080ti*2 , cuda 10.2. And it's work ! But not work in 3090ti.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. Run cmd CUDA_VISIBLE_DEVICES=0,1 fairseq-train article_tw2en_1203_zhen_join/ -a transformer_wmt_en_de_big --optimizer adam -s zh -t en --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 2100000 --warmup-updates 10000 --warmup-init-lr '1e-7' --lr '0.001' --share-all-embeddings --adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --clip-norm 25.0 --keep-last-epochs 2 --update-freq 1 --save-dir checkpoint_test |& tee -a test.log

  2. See error

2021-01-26 14:12:23 | INFO | fairseq_cli.train | task: translation (TranslationTask)
2021-01-26 14:12:23 | INFO | fairseq_cli.train | model: transformer_wmt_en_de_big (TransformerModel)
2021-01-26 14:12:23 | INFO | fairseq_cli.train | criterion: label_smoothed_cross_entropy (LabelSmoothedCrossEntropyCriterion)
2021-01-26 14:12:23 | INFO | fairseq_cli.train | num. model params: 259964928 (num. trained: 259964928)
2021-01-26 14:12:23 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-01-26 14:12:23 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-01-26 14:12:23 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2021-01-26 14:12:23 | INFO | fairseq.utils | rank   0: capabilities =  8.6  ; total memory = 23.700 GB ; name = GeForce RTX 3090
2021-01-26 14:12:23 | INFO | fairseq.utils | rank   1: capabilities =  8.6  ; total memory = 23.697 GB ; name = GeForce RTX 3090
2021-01-26 14:12:23 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2021-01-26 14:12:23 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
2021-01-26 14:12:23 | INFO | fairseq_cli.train | max tokens per GPU = 4000 and max sentences per GPU = None
2021-01-26 14:12:23 | INFO | fairseq.trainer | no existing checkpoint found checkpoint_test/checkpoint_last.pt
2021-01-26 14:12:23 | INFO | fairseq.trainer | loading train data for epoch 1
2021-01-26 14:12:23 | INFO | fairseq.data.data_utils | loaded 92733 examples from: article_tw2en_1203_zhen_join/train.zh-en.zh
2021-01-26 14:12:23 | INFO | fairseq.data.data_utils | loaded 92733 examples from: article_tw2en_1203_zhen_join/train.zh-en.en
2021-01-26 14:12:23 | INFO | fairseq.tasks.translation | article_tw2en_1203_zhen_join/ train zh-en 92733 examples
/usr/lib/python3/dist-packages/torch/nn/parallel/distributed.py:397: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
  warnings.warn(
/usr/lib/python3/dist-packages/torch/nn/parallel/distributed.py:397: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
  warnings.warn(
Traceback (most recent call last):
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq_cli/train.py", line 352, in cli_main
    distributed_utils.call_main(args, main)
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/distributed_utils.py", line 283, in call_main
    torch.multiprocessing.spawn(
  File "/usr/lib/python3/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/lib/python3/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/usr/lib/python3/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/distributed_utils.py", line 270, in distributed_main
    main(args, **kwargs)
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq_cli/train.py", line 110, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 212, in load_checkpoint
    epoch_itr = trainer.get_train_iterator(
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/trainer.py", line 382, in get_train_iterator
    self.reset_dummy_batch(batch_iterator.first_batch)
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/data/iterators.py", line 288, in first_batch
    return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/data/iterators.py", line 288, in <listcomp>
    return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/data/language_pair_dataset.py", line 305, in __getitem__
    tgt_item = self.tgt[index] if self.tgt is not None else None
  File "/home/blue90211/lambda-stack-with-tensorflow-pytorch/lib/python3.8/site-packages/fairseq/data/indexed_dataset.py", line 234, in __getitem__
    ptx = self.cache_index[i]
KeyError: 27707

Code sample

Expected behavior

Environment

  • fairseq Version 0.10.2
  • PyTorch Version 1.7.1
  • OS (e.g., Linux): Linux (Ubuntu 20.04 LTS)
  • How you installed fairseq (pip, source): pip
  • Build command you used (if compiling from source): pip install fairseq
  • Python version: 3.8.5
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration: Transformer
  • Any other relevant information: Run in 2 GeForce RTX 3090

Daisy-123 avatar Jan 26 '21 10:01 Daisy-123

Can you try running the following command:

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 fairseq-train article_tw2en_1203_zhen_join/ -a transformer_wmt_en_de_big --optimizer adam -s zh -t en --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 2100000 --warmup-updates 10000 --warmup-init-lr '1e-7' --lr '0.001' --share-all-embeddings --adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --clip-norm 25.0 --keep-last-epochs 2 --update-freq 1 --save-dir checkpoint_test |& tee -a test.log --num-workers 0

Note the CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 environment variables, and --num-workers 0 flag

lematt1991 avatar Jan 26 '21 14:01 lematt1991

I tried the command, but get the same error! CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 fairseq-train article_tw2en_1203_zhen_join/ -a transformer_wmt_en_de_big --optimizer adam -s zh -t en --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 2100000 --warmup-updates 10000 --warmup-init-lr '1e-7' --lr '0.001' --share-all-embeddings --adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --clip-norm 25.0 --keep-last-epochs 2 --update-freq 1 --num-workers 0 --save-dir checkpoint_test |& tee -a test.log

Daisy-123 avatar Jan 26 '21 15:01 Daisy-123

Hello, I am facing the same issue. Below my training params

fairseq-train data_bin --finetune-from-model ./418M_last_checkpoint.pt --save-dir ./checkpoint --task translation_multi_simple_epoch --encoder-normalize-before --langs 'af,am,ar,ast,az,ba,be,bg,bn,br,bs,ca,ceb,cs,cy,da,de,el,en,es,et,fa,ff,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hr,ht,hu,hy,id,ig,ilo,is,it,ja,jv,ka,kk,km,kn,ko,lb,lg,ln,lo,lt,lv,mg,mk,ml,mn,mr,ms,my,ne,nl,no,ns,oc,or,pa,pl,ps,pt,ro,ru,sd,si,sk,sl,so,sq,sr,ss,su,sv,sw,ta,th,tl,tn,tr,uk,ur,uz,vi,wo,xh,yi,yo,zh,zu' --lang-pairs 'en-fr,fr-en' --max-tokens 1200 --decoder-normalize-before --sampling-method temperature --sampling-temperature 1.5 --encoder-langtok src --decoder-langtok --criterion label_smoothed_cross_entropy --label-smoothing 0.2 --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr 3e-05 --warmup-updates 2500 --max-update 40000 --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 --update-freq 2 --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints --seed 222 --log-format simple --log-interval 2 --patience 10 --arch transformer_wmt_en_de_big --encoder-layers 12 --decoder-layers 12 --encoder-layerdrop 0.05 --decoder-layerdrop 0.05 --share-decoder-input-output-embed --share-all-embeddings --ddp-backend no_c10d

I tried the solution proposed by @lematt1991 but still having the same error.

Thank you a lot in advance

AllaeddineD avatar Jul 06 '22 10:07 AllaeddineD

According to my experience,s this question is about environment. Pytorch version have to match your cuda version.
Follow the command line : https://pytorch.org/get-started/previous-versions/ Sometimes if you using pip to install would produce other problem, then you can try conda.

Daisy-123 avatar Jul 06 '22 15:07 Daisy-123