fairseq
fairseq copied to clipboard
Hydra error in fairseq-generate cli (task: translation_multi_simple_epoch)
🐛 Bug
When running fairseq-generate for task translation_multi_simple_epoch, I get the hydra config error: hydra.errors.ConfigCompositionException: Error merging override distributed_training.ddp_backend='simple'
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
Run command:
fairseq-generate
data_bin
--batch-size 1
--path 12b_last_chk_4_gpus
--fixed-dictionary model_dict.128k.txt
-s de -t fr
--remove-bpe 'sentencepiece'
--beam 5
--task translation_multi_simple_epoch
--lang-pairs language_pairs.txt
--decoder-langtok --encoder-langtok src
--gen-subset test
--fp16
--dataset-impl mmap
--distributed-world-size 1 --distributed-no-spawn
--pipeline-model-parallel
--pipeline-chunks 1
--pipeline-encoder-balance '[1,15,10]'
--pipeline-encoder-devices '[0,1,0]'
--pipeline-decoder-balance '[3,11,11,1]'
--pipeline-decoder-devices '[0,2,3,0]' > gen_out
Error: Traceback (most recent call last): File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 513, in _apply_overrides_to_config OmegaConf.update(cfg, key, value, merge=True) File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 613, in update root.setattr(last_key, value) File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 278, in setattr self._format_and_raise(key=key, value=value, cause=e) File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/omegaconf/_utils.py", line 694, in format_and_raise _raise(ex, cause) File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace omegaconf.errors.ValidationError: Invalid value 'simple', expected one of [c10d, no_c10d] full_key: distributed_training.ddp_backend reference_type=DistributedTrainingConfig object_type=DistributedTrainingConfig
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/raid/user-data/fbraune/anaconda3/envs/m2m/bin/fairseq-generate", line 33, in
Expected behavior
Translation generation
Environment
- fairseq Version : '1.0.0a0+108f720'
- PyTorch Version: 1.7
- OS: Linux
- How you installed fairseq: pip
- Build command you used : cloned current fairseq repo and then pip install --editable ./
- Python version: 3.8.5
- CUDA/cuDNN version: 10.1
- GPU models and configuration: used pre-trained model 12b_last_chk_4_gpus
- Any other relevant information:
why is the ddp_backend set to "simple" in that checkpoint? this is not one the values we support (we only support c10d and no_c10d). where did the checkpoint you are loading come from?
to work around this particular error you can add "simple" as an option to fairseq/dataclass/constants.py in your checked out copy (cant guarantee it wont crash elsewhere later), or update the checkpoint and replace ddp_backend to one of those 2 values
@shruti-bh do you know why ddp_backend is set to this value?
I downloaded it from here: https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_4_gpus.pt It was 27th of october though.
I downloaded the model again but get the same error.
I downloaded it from here: https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_4_gpus.pt It was 27th of october though.
Same with https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_2_gpus.pt
to work around this particular error you can add "simple" as an option to fairseq/dataclass/constants.py in your checked out copy (cant guarantee it wont crash elsewhere later), or update the checkpoint and replace ddp_backend to one of those 2 values
Adding "simple"
to the ChoiceEnum
DDP_BACKEND_CHOICES
in fairseq/dataclass/constants.py
works as such, but just gets me to the next error:
hydra.errors.ConfigCompositionException: Error merging override distributed_training.pipeline_balance=[29, 22, 1]
@laeubli Sorry about this issue. Can you try adding these overrides to the generate.py command? A more long-term fix will be coming up soon.
--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }'
@laeubli Sorry about this issue. Can you try adding these overrides to the generate.py command? A more long-term fix will be coming up soon.
--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }'
Thanks @shruti-bh! Are those params specific to the 2 GPU version? I've tested it with the 8 GPU version for now, and ended up with:
RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel:
Missing key(s) in state_dict: "model.partitions.1.7.self_attn.k_proj.weight", [...]
It certainly got closer to generation than before though (VRAM actually started filling up).
@laeubli - these parameters are for the 4 GPU version, since that is what you were using in your first Run Command
Thanks @shruti-bh I am getting another error. I am running on 4 GPUS (6,5,4,3) : Run command
fairseq-generate \
data_bin \
--batch-size 1 \
--path 12b_last_chk_4_gpus \
--fixed-dictionary model_dict.128k.txt \
-s de -t fr \
--remove-bpe 'sentencepiece' \
--beam 5 \
--task translation_multi_simple_epoch \
--lang-pairs language_pairs.txt \
--decoder-langtok --encoder-langtok src \
--gen-subset test \
--fp16 \
--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "6, 5, 6, 4, 3, 6" }' \
--dataset-impl mmap \
--distributed-world-size 1 --distributed-no-spawn \
--pipeline-model-parallel \
--pipeline-chunks 1 \
--pipeline-encoder-balance '[1,15,10]' \
--pipeline-encoder-devices '[6,5,6]' \
--pipeline-decoder-balance '[3,11,11,1]' \
--pipeline-decoder-devices '[6,4,3,6]' > gen_out
Error:
Traceback (most recent call last):
File "/raid/user-data/fbraune/anaconda3/envs/m2m/bin/fairseq-generate", line 33, in
Is device 6 your default cuda device? If not, I suggest replacing "0" from my "pipeline-devices" with your default cuda device.
@shruti-bh Thank you so much! It is working now. Just changed the visible devices to my gpus and ran the command as you suggested. Run command:
`CUDA_VISIBLE_DEVICES=3,4,5,6 fairseq-generate \
data_bin \
--batch-size 1 \
--path 12b_last_chk_4_gpus \
--fixed-dictionary model_dict.128k.txt \
-s de -t fr \
--remove-bpe 'sentencepiece' \
--beam 5 \
--task translation_multi_simple_epoch \
--lang-pairs language_pairs.txt \
--decoder-langtok --encoder-langtok src \
--gen-subset test \
--fp16 \
--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \
--dataset-impl mmap \
--distributed-world-size 1 --distributed-no-spawn \
--pipeline-model-parallel \
--pipeline-chunks 1 \
--pipeline-encoder-balance '[1,15,10]' \
--pipeline-encoder-devices '[0,1,0]' \
--pipeline-decoder-balance '[3,11,11,1]' \
--pipeline-decoder-devices '[0,2,3,0]'`
What parameters should I fill in model-overrides with 12b_last_chk_8_gpus ?
@shruti-bh What parameters for model-overrides with configuration 12b_last_chk_2_gpus?
@laeubli Sorry about this issue. Can you try adding these overrides to the generate.py command? A more long-term fix will be coming up soon.
--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }'
Where can I find the specific parameter for 12b_last_chk_6_gpus or 12b_last_chk_8_gpus?
https://fairseq.readthedocs.io/en/latest/command_line_tools.html
Tried the following as model-overrides for configuration 12b_last_chk_2_gpus: --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "29, 22, 1" , "pipeline_devices": "0, 1, 0" }'
Process gets killed. No Luck Yet.
Hi @caijie1990 , @tongmeihan1995 , @shruti-bh , Did you find which are the overrides parameters for 12b_last_chk_8_gpus? Have any of you successfully managed to run this configuration? My process gets killed due to OOM.
Thanks. PS. I use the provided generate command with the [...] parameters given for 8 8GB.
Hi @shruti-bh, Still getting the error: hydra.errors.ConfigCompositionException: Error merging override distributed_training.ddp_backend='simple' can you please explain how can i solve it? Thanks
@laeubli Sorry about this issue. Can you try adding these overrides to the generate.py command? A more long-term fix will be coming up soon.
--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }'
Where can I find the specific parameter for 12b_last_chk_6_gpus or 12b_last_chk_8_gpus?
After some trial and error, I've found the specific parameters for the 6 GPU model to be as follows:
--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 9, 9, 10, 7, 7, 8, 1" , "pipeline_devices": "0, 1, 2, 0, 3, 4, 5, 0" }'
Based on that, if I had to guess, the parameters for the 8 GPU model might possibly be this (although I have no way of testing this):
--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 6, 6, 6, 8, 6, 6, 6, 6, 1" , "pipeline_devices": "0, 4, 5, 1, 0, 2, 6, 7, 3, 0" }'
trying to generate with 4 rtx 3090:
fairseq-generate \
bin \
--batch-size 1 \
--path 12b_last_chk_4_gpus.pt \
--fixed-dictionary model_dict.128k.txt \
-s en -t zh \
--remove-bpe 'sentencepiece' \
--beam 5 \
--task translation_multi_simple_epoch \
--lang-pairs language_pairs.txt \
--decoder-langtok --encoder-langtok src \
--gen-subset test \
--fp16 \
--dataset-impl mmap \
--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \
--distributed-world-size 1 --distributed-no-spawn \
--pipeline-model-parallel \
--pipeline-chunks 1 \
--pipeline-encoder-balance '[1,15,10]' \
--pipeline-encoder-devices '[0,1,0]' \
--pipeline-decoder-balance '[3,11,11,1]' \
--pipeline-decoder-devices '[0,2,3,0]'
but getting this error:
Traceback (most recent call last):
File "/home/jack/miniconda3/envs/fairseq/bin/fairseq-generate", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')())
File "/home/jack/fairseq/fairseq_cli/generate.py", line 404, in cli_main
main(args)
File "/home/jack/fairseq/fairseq_cli/generate.py", line 49, in main
return _main(cfg, sys.stdout)
File "/home/jack/fairseq/fairseq_cli/generate.py", line 96, in _main
models, saved_cfg = checkpoint_utils.load_model_ensemble(
File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble
ensemble, args, _task = load_model_ensemble_and_task(
File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 355, in load_model_ensemble_and_task
model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
File "/home/jack/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 334, in load_state_dict
return super().load_state_dict(state_dict, strict)
File "/home/jack/fairseq/fairseq/models/fairseq_model.py", line 115, in load_state_dict
return super().load_state_dict(new_state_dict, strict)
File "/home/jack/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel:
size mismatch for model.partitions.0.0.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
size mismatch for model.partitions.0.0.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
size mismatch for model.partitions.2.26.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
size mismatch for model.partitions.2.26.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
size mismatch for model.partitions.5.51.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
size mismatch for model.partitions.5.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
it seems to be a dict issue, but there are only data_dict.128k.txt and model_dict.128k.txt, and no combination of these during preprocessing and training appears to work.
trying to generate with 4 rtx 3090:
fairseq-generate \ bin \ --batch-size 1 \ --path 12b_last_chk_4_gpus.pt \ --fixed-dictionary model_dict.128k.txt \ -s en -t zh \ --remove-bpe 'sentencepiece' \ --beam 5 \ --task translation_multi_simple_epoch \ --lang-pairs language_pairs.txt \ --decoder-langtok --encoder-langtok src \ --gen-subset test \ --fp16 \ --dataset-impl mmap \ --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \ --distributed-world-size 1 --distributed-no-spawn \ --pipeline-model-parallel \ --pipeline-chunks 1 \ --pipeline-encoder-balance '[1,15,10]' \ --pipeline-encoder-devices '[0,1,0]' \ --pipeline-decoder-balance '[3,11,11,1]' \ --pipeline-decoder-devices '[0,2,3,0]'
but getting this error:
Traceback (most recent call last): File "/home/jack/miniconda3/envs/fairseq/bin/fairseq-generate", line 33, in <module> sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')()) File "/home/jack/fairseq/fairseq_cli/generate.py", line 404, in cli_main main(args) File "/home/jack/fairseq/fairseq_cli/generate.py", line 49, in main return _main(cfg, sys.stdout) File "/home/jack/fairseq/fairseq_cli/generate.py", line 96, in _main models, saved_cfg = checkpoint_utils.load_model_ensemble( File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble ensemble, args, _task = load_model_ensemble_and_task( File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 355, in load_model_ensemble_and_task model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model) File "/home/jack/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 334, in load_state_dict return super().load_state_dict(state_dict, strict) File "/home/jack/fairseq/fairseq/models/fairseq_model.py", line 115, in load_state_dict return super().load_state_dict(new_state_dict, strict) File "/home/jack/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel: size mismatch for model.partitions.0.0.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.0.0.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.2.26.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.2.26.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.5.51.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.5.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
it seems to be a dict issue, but there are only data_dict.128k.txt and model_dict.128k.txt, and no combination of these during preprocessing and training appears to work.
Same for me on 4x TESLA T4...
@shruti-bh run M2M_100 12B model, I use 8 Tesla T4 GPUs,but It's very inefficient. (0.34 sentences/s, 9.99 tokens/s). I want to be more efficient,help me. These are my parameters
fairseq-generate
en-zh/data_bin
--batch-size 1
--path ../../12b_last_chk_8_gpus/12b_last_chk_8_gpus.pt
--fixed-dictionary model_dict.128k.txt
-s en -t zh
--remove-bpe 'sentencepiece'
--beam 5
--task translation_multi_simple_epoch
--lang-pairs language_pairs.txt
--decoder-langtok --encoder-langtok src
--gen-subset test
--fp16
--dataset-impl mmap
--distributed-world-size 1 --distributed-no-spawn
--pipeline-model-parallel
--pipeline-chunks 1
--pipeline-encoder-balance '[1,6,6,6,7]'
--pipeline-encoder-devices '[0,4,5,1,0]'
--pipeline-decoder-balance '[1,6,6,6,6,1]'
--pipeline-decoder-devices '[0,2,6,7,3,0]' \
trying to generate with 4 rtx 3090:
fairseq-generate \ bin \ --batch-size 1 \ --path 12b_last_chk_4_gpus.pt \ --fixed-dictionary model_dict.128k.txt \ -s en -t zh \ --remove-bpe 'sentencepiece' \ --beam 5 \ --task translation_multi_simple_epoch \ --lang-pairs language_pairs.txt \ --decoder-langtok --encoder-langtok src \ --gen-subset test \ --fp16 \ --dataset-impl mmap \ --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \ --distributed-world-size 1 --distributed-no-spawn \ --pipeline-model-parallel \ --pipeline-chunks 1 \ --pipeline-encoder-balance '[1,15,10]' \ --pipeline-encoder-devices '[0,1,0]' \ --pipeline-decoder-balance '[3,11,11,1]' \ --pipeline-decoder-devices '[0,2,3,0]'
but getting this error:
Traceback (most recent call last): File "/home/jack/miniconda3/envs/fairseq/bin/fairseq-generate", line 33, in <module> sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')()) File "/home/jack/fairseq/fairseq_cli/generate.py", line 404, in cli_main main(args) File "/home/jack/fairseq/fairseq_cli/generate.py", line 49, in main return _main(cfg, sys.stdout) File "/home/jack/fairseq/fairseq_cli/generate.py", line 96, in _main models, saved_cfg = checkpoint_utils.load_model_ensemble( File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble ensemble, args, _task = load_model_ensemble_and_task( File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 355, in load_model_ensemble_and_task model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model) File "/home/jack/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 334, in load_state_dict return super().load_state_dict(state_dict, strict) File "/home/jack/fairseq/fairseq/models/fairseq_model.py", line 115, in load_state_dict return super().load_state_dict(new_state_dict, strict) File "/home/jack/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel: size mismatch for model.partitions.0.0.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.0.0.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.2.26.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.2.26.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.5.51.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.5.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
it seems to be a dict issue, but there are only data_dict.128k.txt and model_dict.128k.txt, and no combination of these during preprocessing and training appears to work.
Same for me on 4x TESLA T4...
Same for me on 8*V100, is it slovled?
trying to generate with 4 rtx 3090:
fairseq-generate \ bin \ --batch-size 1 \ --path 12b_last_chk_4_gpus.pt \ --fixed-dictionary model_dict.128k.txt \ -s en -t zh \ --remove-bpe 'sentencepiece' \ --beam 5 \ --task translation_multi_simple_epoch \ --lang-pairs language_pairs.txt \ --decoder-langtok --encoder-langtok src \ --gen-subset test \ --fp16 \ --dataset-impl mmap \ --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \ --distributed-world-size 1 --distributed-no-spawn \ --pipeline-model-parallel \ --pipeline-chunks 1 \ --pipeline-encoder-balance '[1,15,10]' \ --pipeline-encoder-devices '[0,1,0]' \ --pipeline-decoder-balance '[3,11,11,1]' \ --pipeline-decoder-devices '[0,2,3,0]'
but getting this error:
Traceback (most recent call last): File "/home/jack/miniconda3/envs/fairseq/bin/fairseq-generate", line 33, in <module> sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')()) File "/home/jack/fairseq/fairseq_cli/generate.py", line 404, in cli_main main(args) File "/home/jack/fairseq/fairseq_cli/generate.py", line 49, in main return _main(cfg, sys.stdout) File "/home/jack/fairseq/fairseq_cli/generate.py", line 96, in _main models, saved_cfg = checkpoint_utils.load_model_ensemble( File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble ensemble, args, _task = load_model_ensemble_and_task( File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 355, in load_model_ensemble_and_task model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model) File "/home/jack/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 334, in load_state_dict return super().load_state_dict(state_dict, strict) File "/home/jack/fairseq/fairseq/models/fairseq_model.py", line 115, in load_state_dict return super().load_state_dict(new_state_dict, strict) File "/home/jack/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel: size mismatch for model.partitions.0.0.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.0.0.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.2.26.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.2.26.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.5.51.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.5.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
it seems to be a dict issue, but there are only data_dict.128k.txt and model_dict.128k.txt, and no combination of these during preprocessing and training appears to work.
Same for me on 4x TESLA T4...
Same for me on 8*V100, is it slovled?
This may be a mistake in doc.
Using language_pairs_small_models.txt
instead of language_pairs.txt
works for me.
My setup is 2*V100 32G SXM2, detailed script below:
src=it
tgt=zh
fairseq-generate \
data-bin/m2m100.$src-$tgt \
--batch-size 1 \
--path 12b_last_chk_2_gpus.pt \
--fixed-dictionary model_dict.128k.txt \
-s $src -t $tgt \
--remove-bpe 'sentencepiece' \
--beam 5 \
--task translation_multi_simple_epoch \
--lang-pairs language_pairs_small_models.txt \
--decoder-langtok --encoder-langtok src \
--gen-subset test \
--fp16 \
--dataset-impl mmap \
--distributed-world-size 1 --distributed-no-spawn \
--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "29, 22, 1" , "pipeline_devices": "0, 1, 0" }' \
--pipeline-model-parallel \
--pipeline-chunks 1 \
--pipeline-encoder-balance '[26]' \
--pipeline-encoder-devices '[0]' \
--pipeline-decoder-balance '[3,22,1]' \
--pipeline-decoder-devices '[0,1,0]'
Generation is extremely slow, wps=17, takes 24G/32G memory per GPU.
trying to generate with 4 rtx 3090:
fairseq-generate \ bin \ --batch-size 1 \ --path 12b_last_chk_4_gpus.pt \ --fixed-dictionary model_dict.128k.txt \ -s en -t zh \ --remove-bpe 'sentencepiece' \ --beam 5 \ --task translation_multi_simple_epoch \ --lang-pairs language_pairs.txt \ --decoder-langtok --encoder-langtok src \ --gen-subset test \ --fp16 \ --dataset-impl mmap \ --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \ --distributed-world-size 1 --distributed-no-spawn \ --pipeline-model-parallel \ --pipeline-chunks 1 \ --pipeline-encoder-balance '[1,15,10]' \ --pipeline-encoder-devices '[0,1,0]' \ --pipeline-decoder-balance '[3,11,11,1]' \ --pipeline-decoder-devices '[0,2,3,0]'
but getting this error:
Traceback (most recent call last): File "/home/jack/miniconda3/envs/fairseq/bin/fairseq-generate", line 33, in <module> sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')()) File "/home/jack/fairseq/fairseq_cli/generate.py", line 404, in cli_main main(args) File "/home/jack/fairseq/fairseq_cli/generate.py", line 49, in main return _main(cfg, sys.stdout) File "/home/jack/fairseq/fairseq_cli/generate.py", line 96, in _main models, saved_cfg = checkpoint_utils.load_model_ensemble( File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble ensemble, args, _task = load_model_ensemble_and_task( File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 355, in load_model_ensemble_and_task model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model) File "/home/jack/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 334, in load_state_dict return super().load_state_dict(state_dict, strict) File "/home/jack/fairseq/fairseq/models/fairseq_model.py", line 115, in load_state_dict return super().load_state_dict(new_state_dict, strict) File "/home/jack/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel: size mismatch for model.partitions.0.0.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.0.0.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.2.26.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.2.26.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.5.51.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]). size mismatch for model.partitions.5.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
it seems to be a dict issue, but there are only data_dict.128k.txt and model_dict.128k.txt, and no combination of these during preprocessing and training appears to work.
Same for me on 4x TESLA T4...
Same for me on 8*V100, is it slovled?
After switching to fairseq==0.10.2 version (python=3.7, pip install fairscale), instead of the current master version on the github, I am able to run it with following commands on 4 * RTX6000:
fairseq-generate data_bin --batch-size 1 --path 12b_last_chk_4_gpus.pt --fixed-dictionary model_dict.128k.txt -s de -t fr --remove-bpe 'sentencepiece' --beam 5 --task translation_multi_simple_epoch --lang-pairs language_pairs.txt --decoder-langtok --encoder-langtok src --gen-subset test --fp16 --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' --dataset-impl mmap --distributed-world-size 1 --distributed-no-spawn --pipeline-model-parallel --pipeline-chunks 1 --pipeline-encoder-balance '[1,15,10]' --pipeline-encoder-devices '[0,1,0]' --pipeline-decoder-balance '[3,11,11,1]' --pipeline-decoder-devices '[0,2,3,0]'
output:
2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | [de] dictionary: 128112 types 2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | [fr] dictionary: 128112 types 2021-11-29 13:57:11 | INFO | fairseq.tasks.translation_multi_simple_epoch | loading data for test epoch=1/None 2021-11-29 13:57:11 | INFO | fairseq.tasks.translation_multi_simple_epoch | mem usage: N/A 2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | langtoks settings: {'main': ('src', 'tgt')} 2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | [test] num of shards: {'main:de-fr': 1} 2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | main:de-fr src_langtok: 128020; tgt_langtok: 128028 2021-11-29 13:57:11 | INFO | fairseq.data.data_utils | loaded 20 examples from: data_bin/test.de-fr.de 2021-11-29 13:57:11 | INFO | fairseq.data.data_utils | loaded 20 examples from: data_bin/test.de-fr.fr 2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | data_bin test de-fr 20 examples 2021-11-29 13:57:11 | INFO | fairseq_cli.generate | loading model(s) from 12b_last_chk_4_gpus.pt 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | start batch sampler: mem usage: N/A 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | [test] @batch_sampler order indices time: 0:00:00.002948 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | mem usage: N/A 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | [test] @batch_sampler filter_by_size time: 0:00:00.000244 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | mem usage: N/A 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | [test] @batch_sampler batch_by_size time: 0:00:00.003375 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | [test] per epoch batch_sampler set-up time: 0:00:00.007069 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | mem usage: N/A 0%| | 0/20 [00:00<?, ?it/s] unfin_idx = idx // beam_size S-9 __de__ Der Parteitag spendete ihr Genesungswünsche. T-9 Le congrès du parti lui a souhaité un bon rétablissement. H-9 -1.0958120822906494 Le congrès a fait des vœux de rétablissement. D-9 -1.0958120822906494 Le congrès a fait des vœux de rétablissement. P-9 -5.5446 -1.0863 -1.8819 -0.0843 -1.4484 -2.5844 -1.6294 -0.7081 -0.0993 -0.1612 -0.2383 -0.9995 -0.5417 -0.1964 -0.1850 -0.1443 5%|█████▊
@all, does the generation only works on 32GB gpu ? What if I have got 48GB gpus?
I have already used m2m_100 successfully last year. Now I tried to generate ro-en translation with the current fairseq version and the m2m100 models for 6 and 8 GPUs. However, I get the same errors as the ones described above, about size mismatch: size mismatch for model.partitions.9.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
Then I reverted to a older version of the repo of which I thought that it could work. There I got another error: fairseq/fairseq/data/multilingual/multilingual_data_manager.py", line 903, in get_split_data_param_list paths, epoch, shard_epoch, split_num_shards_dict[key] KeyError: 'main:ro-en'
Could you help me?
I have already used m2m_100 successfully last year. Now I tried to generate ro-en translation with the current fairseq version and the m2m100 models for 6 and 8 GPUs. However, I get the same errors as the ones described above, about size mismatch: size mismatch for model.partitions.9.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
Then I reverted to a older version of the repo of which I thought that it could work. There I got another error: fairseq/fairseq/data/multilingual/multilingual_data_manager.py", line 903, in get_split_data_param_list paths, epoch, shard_epoch, split_num_shards_dict[key] KeyError: 'main:ro-en'
Could you help me?
huggingface suports m2m now
@edchengg , thanks a lot for your reply! I was trying something with an own class built in a fork of fairseq, so for me it would have been better to work further with fairseq. I will see if I can put my code also somewhere inside a huggingface fork or if I train an own ro-en model.
Hello. I am getting the same hydra error for finetuning the large (12B) model. I am using 2 32 GB GPUs and the 2 GPU pretrained checkpoint with the pipeline parameters set as --pipeline-balance '[29,22,1]' --pipeline-devices '[0,1,0]'
. The error I am getting is hydra.errors.ConfigCompositionException: Error merging override distributed_training.ddp_backend='simple'
. As there is no --model-overrides
option in train, any suggestion what I should do. Thank you.