fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

How to do validation when training an NMT model use srun with multi Gpus

Open xqun3 opened this issue 2 years ago • 0 comments

I finetuned the deltam use srun with multi gpu, the training script shown below which is modified from the demo

python train.py $data_bin \
    --distributed-port 12345 \
    --no-save --disable-validation \
    --save-dir $save_dir \
    --arch deltalm_base \
    --pretrained-deltalm-checkpoint $PRETRAINED_MODEL \
    --share-all-embeddings \
    --max-source-positions 512 --max-target-positions 512 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --optimizer adam --adam-betas '(0.9, 0.98)' \
    --lr-scheduler inverse_sqrt \
    --lr 1e-4 \
    --warmup-init-lr 1e-07 \
    --stop-min-lr 1e-09 \
    --warmup-updates 4000 \
    --max-update 400000 \
    --max-epoch 100 \
    --max-tokens 1024 \
    --update-freq 1 \
    --seed 1 \
    --log-format simple \
    --skip-invalid-size-inputs-valid-test \
    --fp16 \
    --tensorboard-logdir logs2 \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe=sentencepiece \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric

but it will crash in the validation step

Traceback (most recent call last):
  File "/XXX/software/anaconda3/envs/common/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq_cli/train.py", line 180, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/XXX/software/anaconda3/envs/common/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq_cli/train.py", line 305, in train
    valid_losses, should_stop = validate_and_save(
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq_cli/train.py", line 392, in validate_and_save
    valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets)
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq_cli/train.py", line 462, in validate
    trainer.valid_step(sample)
  File "/XXX/software/anaconda3/envs/common/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/trainer.py", line 1082, in valid_step
    logging_output = self._reduce_and_log_stats(logging_outputs, sample_size)
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/trainer.py", line 1452, in _reduce_and_log_stats
    logging_output = agg.get_smoothed_values()
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/logging/meters.py", line 302, in get_smoothed_values
    [
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/logging/meters.py", line 303, in <listcomp>
    (key, self.get_smoothed_value(key))
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/logging/meters.py", line 295, in get_smoothed_value
    return meter.fn(self)
  File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/tasks/translation.py", line 434, in compute_bleu
    bleu = comp_bleu(
  File "/XXX/.local/lib/python3.8/site-packages/sacrebleu/metrics/bleu.py", line 282, in compute_bleu
    return BLEUScore(score, correct, total, precisions, bp, sys_len, ref_len)
  File "/XXX/.local/lib/python3.8/site-packages/sacrebleu/metrics/bleu.py", line 103, in __init__
    self._verbose += f"ratio = {self.ratio:.3f} hyp_len = {self.sys_len:d} "
  File "/XXX/software/anaconda3/envs/common/lib/python3.8/site-packages/torch/_tensor.py", line 560, in __format__
    return self.item().__format__(format_spec)
ValueError: Unknown format code 'd' for object of type 'float'

/XXX/software/anaconda3/envs/common/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 16 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

My environment?

  • fairseq Version (e.g., 1.0 or main): 1.0.0a0+e3fafbd
  • PyTorch Version (e.g., 1.0): 1.9.0+cu111
  • OS (e.g., Linux): Linux
  • How you installed fairseq (pip, source): pip
  • Python version: 3.8.13
  • CUDA/cuDNN version: 11.10
  • GPU models and configuration: deltalm

xqun3 avatar Jul 15 '22 06:07 xqun3