fairseq
fairseq copied to clipboard
How to do validation when training an NMT model use srun with multi Gpus
I finetuned the deltam use srun with multi gpu, the training script shown below which is modified from the demo
python train.py $data_bin \
--distributed-port 12345 \
--no-save --disable-validation \
--save-dir $save_dir \
--arch deltalm_base \
--pretrained-deltalm-checkpoint $PRETRAINED_MODEL \
--share-all-embeddings \
--max-source-positions 512 --max-target-positions 512 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt \
--lr 1e-4 \
--warmup-init-lr 1e-07 \
--stop-min-lr 1e-09 \
--warmup-updates 4000 \
--max-update 400000 \
--max-epoch 100 \
--max-tokens 1024 \
--update-freq 1 \
--seed 1 \
--log-format simple \
--skip-invalid-size-inputs-valid-test \
--fp16 \
--tensorboard-logdir logs2 \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe=sentencepiece \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric
but it will crash in the validation step
Traceback (most recent call last):
File "/XXX/software/anaconda3/envs/common/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq_cli/train.py", line 180, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/XXX/software/anaconda3/envs/common/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq_cli/train.py", line 305, in train
valid_losses, should_stop = validate_and_save(
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq_cli/train.py", line 392, in validate_and_save
valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets)
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq_cli/train.py", line 462, in validate
trainer.valid_step(sample)
File "/XXX/software/anaconda3/envs/common/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/trainer.py", line 1082, in valid_step
logging_output = self._reduce_and_log_stats(logging_outputs, sample_size)
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/trainer.py", line 1452, in _reduce_and_log_stats
logging_output = agg.get_smoothed_values()
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/logging/meters.py", line 302, in get_smoothed_values
[
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/logging/meters.py", line 303, in <listcomp>
(key, self.get_smoothed_value(key))
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/logging/meters.py", line 295, in get_smoothed_value
return meter.fn(self)
File "/XXX/project/translate/unilm/deltalm/fairseq/fairseq/tasks/translation.py", line 434, in compute_bleu
bleu = comp_bleu(
File "/XXX/.local/lib/python3.8/site-packages/sacrebleu/metrics/bleu.py", line 282, in compute_bleu
return BLEUScore(score, correct, total, precisions, bp, sys_len, ref_len)
File "/XXX/.local/lib/python3.8/site-packages/sacrebleu/metrics/bleu.py", line 103, in __init__
self._verbose += f"ratio = {self.ratio:.3f} hyp_len = {self.sys_len:d} "
File "/XXX/software/anaconda3/envs/common/lib/python3.8/site-packages/torch/_tensor.py", line 560, in __format__
return self.item().__format__(format_spec)
ValueError: Unknown format code 'd' for object of type 'float'
/XXX/software/anaconda3/envs/common/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 16 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
My environment?
- fairseq Version (e.g., 1.0 or main): 1.0.0a0+e3fafbd
- PyTorch Version (e.g., 1.0): 1.9.0+cu111
- OS (e.g., Linux): Linux
- How you installed fairseq (
pip
, source): pip - Python version: 3.8.13
- CUDA/cuDNN version: 11.10
- GPU models and configuration: deltalm