guided_summarization
guided_summarization copied to clipboard
ZeroDivisionError: float division by zero
I met the "ZeroDivisionError: float division by zero" when I want to train the model with multi-gpu. And if only 1 gpu, the problem disappear but the training is too slow... And the detailed traceback is below, do you have any idea about it?
/home/xjw/code/guided_summarization/src/fairseq/fairseq/optim/adam.py:179: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1025.)
exp_avg.mul_(beta1).add_(1 - beta1, grad)
/home/xjw/code/guided_summarization/src/fairseq/fairseq/optim/adam.py:179: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1025.)
exp_avg.mul_(beta1).add_(1 - beta1, grad)
Traceback (most recent call last):
File "/home/xjw/miniconda3/envs/gsum/bin/fairseq-train", line 33, in
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 283, in distributed_main main(args, init_distributed=True) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 102, in main train(args, trainer, task, epoch_itr) File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/contextlib.py", line 52, in inner return func(*args, **kwds) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 178, in train log_output = trainer.train_step(samples) File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/contextlib.py", line 52, in inner return func(*args, **kwds) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/trainer.py", line 391, in train_step logging_output = self._reduce_and_log_stats(logging_outputs, sample_size) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/trainer.py", line 718, in _reduce_and_log_stats self.task.reduce_metrics(logging_outputs, self.get_criterion()) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/tasks/guided_translation.py", line 307, in reduce_metrics super().reduce_metrics(logging_outputs, criterion) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/tasks/fairseq_task.py", line 406, in reduce_metrics criterion.class.reduce_metrics(logging_outputs) File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 95, in reduce_metrics metrics.log_scalar('loss', loss_sum / sample_size / math.log(2), sample_size, round=3) ZeroDivisionError: float division by zero
Hi, could you pip install torch==1.4.0
and try again? I suspect this is a version mismatch problem.
pip install torch==1.4.0
Thanks for reply! I use pip install torch==1.4.0
but the program got stuck at the very begining. It seems that all the 4 GPUs did not be used correctly.
And my cuda version is cuda-11.1
Hi, what is your training command?
Hi, what is your training command?
Training command is like below. What is your cuda versio when using pytorch 1.4.0. I found that cuda 11.1 is not supported by pytorch 1.4.0.
TOTAL_NUM_UPDATES=80000
WARMUP_UPDATES=500
LR=3e-05
MAX_TOKENS=1024
DEVICES=3,4,5,6
CUDA_VISIBLE_DEVICES=$DEVICES python train.py $DATA_BIN
--restore-file $BART_PATH
--max-tokens $MAX_TOKENS
--max-sentences 1
--task guided_translation
--source-lang source --target-lang target
--truncate-source
--layernorm-embedding
--share-all-embeddings
--share-decoder-input-output-embed
--reset-optimizer --reset-dataloader --reset-meters
--required-batch-size-multiple 1
--arch guided_bart_large
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--dropout 0.1 --attention-dropout 0.1
--weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08
--clip-norm 0.1
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES
--fp16 --update-freq $UPDATE_FREQ
--max-epoch 10
--skip-invalid-size-inputs-valid-test
--ddp-backend=no_c10d
--save-dir $SAVE_DIR
--save-interval-updates 2500
--find-unused-parameters;
maybe you can try removing the option --max-sentences 1
?
Is this resolved? I met the same issue, I didn't set --max-sentences to 1. Somehow the values for logging_outputs are shifted and value for sample_size becomes 0.
Is this resolved? I met the same issue, ZeroDivisionError: float division by zero
Is this resolved? I met the same issue, ZeroDivisionError: float division by zero
Have you find any solution for this issue? Thanks.