guided_summarization icon indicating copy to clipboard operation
guided_summarization copied to clipboard

ZeroDivisionError: float division by zero

Open JaniceXiong opened this issue 3 years ago • 9 comments

I met the "ZeroDivisionError: float division by zero" when I want to train the model with multi-gpu. And if only 1 gpu, the problem disappear but the training is too slow... And the detailed traceback is below, do you have any idea about it?

/home/xjw/code/guided_summarization/src/fairseq/fairseq/optim/adam.py:179: UserWarning: This overload of add_ is deprecated: add_(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1025.) exp_avg.mul_(beta1).add_(1 - beta1, grad) /home/xjw/code/guided_summarization/src/fairseq/fairseq/optim/adam.py:179: UserWarning: This overload of add_ is deprecated: add_(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1025.) exp_avg.mul_(beta1).add_(1 - beta1, grad) Traceback (most recent call last):
File "/home/xjw/miniconda3/envs/gsum/bin/fairseq-train", line 33, in sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 316, in cli_main nprocs=args.distributed_world_size, File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 283, in distributed_main main(args, init_distributed=True) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 102, in main train(args, trainer, task, epoch_itr) File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/contextlib.py", line 52, in inner return func(*args, **kwds) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 178, in train log_output = trainer.train_step(samples) File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/contextlib.py", line 52, in inner return func(*args, **kwds) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/trainer.py", line 391, in train_step logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/trainer.py", line 718, in _reduce_and_log_stats self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/tasks/guided_translation.py", line 307, in reduce_metrics super().reduce_metrics(logging_outputs, criterion) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/tasks/fairseq_task.py", line 406, in reduce_metrics criterion.class.reduce_metrics(logging_outputs) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 95, in reduce_metrics metrics.log_scalar('loss', loss_sum / sample_size / math.log(2), sample_size, round=3) ZeroDivisionError: float division by zero

JaniceXiong avatar Sep 28 '21 13:09 JaniceXiong

Hi, could you pip install torch==1.4.0 and try again? I suspect this is a version mismatch problem.

zdou0830 avatar Sep 28 '21 16:09 zdou0830

pip install torch==1.4.0

Thanks for reply! I use pip install torch==1.4.0 but the program got stuck at the very begining. It seems that all the 4 GPUs did not be used correctly.

1632902491(1)

1632902635(1)

JaniceXiong avatar Sep 29 '21 08:09 JaniceXiong

And my cuda version is cuda-11.1

JaniceXiong avatar Sep 29 '21 08:09 JaniceXiong

Hi, what is your training command?

zdou0830 avatar Sep 29 '21 15:09 zdou0830

Hi, what is your training command?

Training command is like below. What is your cuda versio when using pytorch 1.4.0. I found that cuda 11.1 is not supported by pytorch 1.4.0.

TOTAL_NUM_UPDATES=80000
WARMUP_UPDATES=500
LR=3e-05 MAX_TOKENS=1024 DEVICES=3,4,5,6 CUDA_VISIBLE_DEVICES=$DEVICES python train.py $DATA_BIN
--restore-file $BART_PATH
--max-tokens $MAX_TOKENS
--max-sentences 1
--task guided_translation
--source-lang source --target-lang target
--truncate-source
--layernorm-embedding
--share-all-embeddings
--share-decoder-input-output-embed
--reset-optimizer --reset-dataloader --reset-meters
--required-batch-size-multiple 1
--arch guided_bart_large
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--dropout 0.1 --attention-dropout 0.1
--weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08
--clip-norm 0.1
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES
--fp16 --update-freq $UPDATE_FREQ
--max-epoch 10
--skip-invalid-size-inputs-valid-test
--ddp-backend=no_c10d
--save-dir $SAVE_DIR
--save-interval-updates 2500
--find-unused-parameters;

JaniceXiong avatar Sep 30 '21 08:09 JaniceXiong

maybe you can try removing the option --max-sentences 1?

zdou0830 avatar Sep 30 '21 17:09 zdou0830

Is this resolved? I met the same issue, I didn't set --max-sentences to 1. Somehow the values for logging_outputs are shifted and value for sample_size becomes 0.

liaimi avatar Dec 09 '21 02:12 liaimi

Is this resolved? I met the same issue, ZeroDivisionError: float division by zero

ARDUJS avatar Jan 05 '22 04:01 ARDUJS

Is this resolved? I met the same issue, ZeroDivisionError: float division by zero

Have you find any solution for this issue? Thanks.

nargesdel avatar Jan 16 '23 19:01 nargesdel