fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Hydra error in fairseq-generate cli (task: translation_multi_simple_epoch)

Open braunefe opened this issue 3 years ago • 30 comments

🐛 Bug

When running fairseq-generate for task translation_multi_simple_epoch, I get the hydra config error: hydra.errors.ConfigCompositionException: Error merging override distributed_training.ddp_backend='simple'

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run command: fairseq-generate
data_bin
--batch-size 1
--path 12b_last_chk_4_gpus
--fixed-dictionary model_dict.128k.txt
-s de -t fr
--remove-bpe 'sentencepiece'
--beam 5
--task translation_multi_simple_epoch
--lang-pairs language_pairs.txt
--decoder-langtok --encoder-langtok src
--gen-subset test
--fp16
--dataset-impl mmap
--distributed-world-size 1 --distributed-no-spawn
--pipeline-model-parallel
--pipeline-chunks 1
--pipeline-encoder-balance '[1,15,10]'
--pipeline-encoder-devices '[0,1,0]'
--pipeline-decoder-balance '[3,11,11,1]'
--pipeline-decoder-devices '[0,2,3,0]' > gen_out

Error: Traceback (most recent call last): File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 513, in _apply_overrides_to_config OmegaConf.update(cfg, key, value, merge=True) File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 613, in update root.setattr(last_key, value) File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 278, in setattr self._format_and_raise(key=key, value=value, cause=e) File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/omegaconf/base.py", line 95, in _format_and_raise format_and_raise( File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/omegaconf/_utils.py", line 694, in format_and_raise _raise(ex, cause) File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/omegaconf/_utils.py", line 610, in _raise raise ex # set end OC_CAUSE=1 for full backtrace omegaconf.errors.ValidationError: Invalid value 'simple', expected one of [c10d, no_c10d] full_key: distributed_training.ddp_backend reference_type=DistributedTrainingConfig object_type=DistributedTrainingConfig

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/raid/user-data/fbraune/anaconda3/envs/m2m/bin/fairseq-generate", line 33, in sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')()) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq_cli/generate.py", line 389, in cli_main main(args) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq_cli/generate.py", line 50, in main return _main(cfg, sys.stdout) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq_cli/generate.py", line 97, in _main models, _model_args = checkpoint_utils.load_model_ensemble( File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq/checkpoint_utils.py", line 257, in load_model_ensemble ensemble, args, _task = load_model_ensemble_and_task( File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq/checkpoint_utils.py", line 287, in load_model_ensemble_and_task state = load_checkpoint_to_cpu(filename, arg_overrides) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq/checkpoint_utils.py", line 239, in load_checkpoint_to_cpu state = _upgrade_state_dict(state) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq/checkpoint_utils.py", line 460, in _upgrade_state_dict state["cfg"] = convert_namespace_to_omegaconf(state["args"]) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq/dataclass/utils.py", line 295, in convert_namespace_to_omegaconf composed_cfg = compose("config", overrides=overrides, strict=False) File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/hydra/experimental/compose.py", line 31, in compose cfg = gh.hydra.compose_config( File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 507, in compose_config cfg = self.config_loader.load_configuration( File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 151, in load_configuration return self._load_configuration( File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 277, in _load_configuration ConfigLoaderImpl._apply_overrides_to_config(config_overrides, cfg) File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 520, in _apply_overrides_to_config raise ConfigCompositionException( hydra.errors.ConfigCompositionException: Error merging override distributed_training.ddp_backend='simple'

Expected behavior

Translation generation

Environment

  • fairseq Version : '1.0.0a0+108f720'
  • PyTorch Version: 1.7
  • OS: Linux
  • How you installed fairseq: pip
  • Build command you used : cloned current fairseq repo and then pip install --editable ./
  • Python version: 3.8.5
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: used pre-trained model 12b_last_chk_4_gpus
  • Any other relevant information:

braunefe avatar Nov 11 '20 13:11 braunefe

why is the ddp_backend set to "simple" in that checkpoint? this is not one the values we support (we only support c10d and no_c10d). where did the checkpoint you are loading come from?

to work around this particular error you can add "simple" as an option to fairseq/dataclass/constants.py in your checked out copy (cant guarantee it wont crash elsewhere later), or update the checkpoint and replace ddp_backend to one of those 2 values

alexeib avatar Nov 13 '20 01:11 alexeib

@shruti-bh do you know why ddp_backend is set to this value?

alexeib avatar Nov 13 '20 01:11 alexeib

I downloaded it from here: https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_4_gpus.pt It was 27th of october though.

braunefe avatar Nov 13 '20 10:11 braunefe

I downloaded the model again but get the same error.

braunefe avatar Nov 13 '20 11:11 braunefe

I downloaded it from here: https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_4_gpus.pt It was 27th of october though.

Same with https://dl.fbaipublicfiles.com/m2m_100/12b_last_chk_2_gpus.pt

laeubli avatar Nov 13 '20 13:11 laeubli

to work around this particular error you can add "simple" as an option to fairseq/dataclass/constants.py in your checked out copy (cant guarantee it wont crash elsewhere later), or update the checkpoint and replace ddp_backend to one of those 2 values

Adding "simple" to the ChoiceEnum DDP_BACKEND_CHOICES in fairseq/dataclass/constants.py works as such, but just gets me to the next error:

hydra.errors.ConfigCompositionException: Error merging override distributed_training.pipeline_balance=[29, 22, 1]

laeubli avatar Nov 13 '20 13:11 laeubli

@laeubli Sorry about this issue. Can you try adding these overrides to the generate.py command? A more long-term fix will be coming up soon. --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }'

shruti-bh avatar Nov 13 '20 20:11 shruti-bh

@laeubli Sorry about this issue. Can you try adding these overrides to the generate.py command? A more long-term fix will be coming up soon. --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }'

Thanks @shruti-bh! Are those params specific to the 2 GPU version? I've tested it with the 8 GPU version for now, and ended up with:

RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel:
	Missing key(s) in state_dict: "model.partitions.1.7.self_attn.k_proj.weight", [...]

It certainly got closer to generation than before though (VRAM actually started filling up).

laeubli avatar Nov 14 '20 00:11 laeubli

@laeubli - these parameters are for the 4 GPU version, since that is what you were using in your first Run Command

shruti-bh avatar Nov 16 '20 04:11 shruti-bh

Thanks @shruti-bh I am getting another error. I am running on 4 GPUS (6,5,4,3) : Run command

fairseq-generate \
    data_bin \
    --batch-size 1 \
    --path 12b_last_chk_4_gpus \
    --fixed-dictionary model_dict.128k.txt \
    -s de -t fr \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "6, 5, 6, 4, 3, 6" }' \
    --dataset-impl mmap \
    --distributed-world-size 1 --distributed-no-spawn \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[1,15,10]' \
    --pipeline-encoder-devices '[6,5,6]' \
    --pipeline-decoder-balance '[3,11,11,1]' \
    --pipeline-decoder-devices '[6,4,3,6]' > gen_out

Error: Traceback (most recent call last): File "/raid/user-data/fbraune/anaconda3/envs/m2m/bin/fairseq-generate", line 33, in sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')()) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq_cli/generate.py", line 389, in cli_main main(args) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq_cli/generate.py", line 50, in main return _main(cfg, sys.stdout) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq_cli/generate.py", line 199, in _main hypos = task.inference_step( File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq/tasks/translation_multi_simple_epoch.py", line 235, in inference_step return generator.generate( File "/raid/user-data/fbraune/anaconda3/envs/m2m/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq/sequence_generator.py", line 177, in generate return self._generate(sample, **kwargs) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq/sequence_generator.py", line 242, in _generate encoder_outs = self.model.reorder_encoder_out(encoder_outs, new_order) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq/sequence_generator.py", line 888, in reorder_encoder_out model.encoder.reorder_encoder_out(encoder_outs[i], new_order) File "/raid/user-data/fbraune/MachineTranslation/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 512, in reorder_encoder_out encoder_out=encoder_out.encoder_out.index_select(1, new_order) RuntimeError: Input, output and indices must be on the current device

braunefe avatar Nov 17 '20 09:11 braunefe

Is device 6 your default cuda device? If not, I suggest replacing "0" from my "pipeline-devices" with your default cuda device.

shruti-bh avatar Nov 17 '20 19:11 shruti-bh

@shruti-bh Thank you so much! It is working now. Just changed the visible devices to my gpus and ran the command as you suggested. Run command:

`CUDA_VISIBLE_DEVICES=3,4,5,6 fairseq-generate \
    data_bin \
    --batch-size 1 \
    --path 12b_last_chk_4_gpus \
    --fixed-dictionary model_dict.128k.txt \
    -s de -t fr \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \
    --dataset-impl mmap \
    --distributed-world-size 1 --distributed-no-spawn \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[1,15,10]' \
    --pipeline-encoder-devices '[0,1,0]' \
    --pipeline-decoder-balance '[3,11,11,1]' \
    --pipeline-decoder-devices '[0,2,3,0]'`

braunefe avatar Nov 18 '20 11:11 braunefe

What parameters should I fill in model-overrides with 12b_last_chk_8_gpus ?

caijie1990 avatar Nov 23 '20 09:11 caijie1990

@shruti-bh What parameters for model-overrides with configuration 12b_last_chk_2_gpus?

bhakthan avatar Nov 24 '20 10:11 bhakthan

@laeubli Sorry about this issue. Can you try adding these overrides to the generate.py command? A more long-term fix will be coming up soon. --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }'

Where can I find the specific parameter for 12b_last_chk_6_gpus or 12b_last_chk_8_gpus?

tongmeihan1995 avatar Nov 25 '20 03:11 tongmeihan1995

image https://fairseq.readthedocs.io/en/latest/command_line_tools.html

Tried the following as model-overrides for configuration 12b_last_chk_2_gpus: --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "29, 22, 1" , "pipeline_devices": "0, 1, 0" }'

Process gets killed. No Luck Yet.

bhakthan avatar Nov 25 '20 05:11 bhakthan

Hi @caijie1990 , @tongmeihan1995 , @shruti-bh , Did you find which are the overrides parameters for 12b_last_chk_8_gpus? Have any of you successfully managed to run this configuration? My process gets killed due to OOM.

Thanks. PS. I use the provided generate command with the [...] parameters given for 8 8GB.

lauhaide avatar Nov 30 '20 14:11 lauhaide

Hi @shruti-bh, Still getting the error: hydra.errors.ConfigCompositionException: Error merging override distributed_training.ddp_backend='simple' can you please explain how can i solve it? Thanks

GoldMan6 avatar Jan 11 '21 10:01 GoldMan6

@laeubli Sorry about this issue. Can you try adding these overrides to the generate.py command? A more long-term fix will be coming up soon. --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }'

Where can I find the specific parameter for 12b_last_chk_6_gpus or 12b_last_chk_8_gpus?

After some trial and error, I've found the specific parameters for the 6 GPU model to be as follows:

--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 9, 9, 10, 7, 7, 8, 1" , "pipeline_devices": "0, 1, 2, 0, 3, 4, 5, 0" }'

Based on that, if I had to guess, the parameters for the 8 GPU model might possibly be this (although I have no way of testing this):

--model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 6, 6, 6, 8, 6, 6, 6, 6, 1" , "pipeline_devices": "0, 4, 5, 1, 0, 2, 6, 7, 3, 0" }'

ansuckynoob avatar Jan 28 '21 17:01 ansuckynoob

trying to generate with 4 rtx 3090:

fairseq-generate \
    bin \
    --batch-size 1 \
    --path 12b_last_chk_4_gpus.pt \
    --fixed-dictionary model_dict.128k.txt \
    -s en -t zh \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \
    --distributed-world-size 1 --distributed-no-spawn \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[1,15,10]' \
    --pipeline-encoder-devices '[0,1,0]' \
    --pipeline-decoder-balance '[3,11,11,1]' \
    --pipeline-decoder-devices '[0,2,3,0]'

but getting this error:


Traceback (most recent call last):
  File "/home/jack/miniconda3/envs/fairseq/bin/fairseq-generate", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')())
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 404, in cli_main
    main(args)
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 49, in main
    return _main(cfg, sys.stdout)
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 96, in _main
    models, saved_cfg = checkpoint_utils.load_model_ensemble(
  File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble
    ensemble, args, _task = load_model_ensemble_and_task(
  File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 355, in load_model_ensemble_and_task
    model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
  File "/home/jack/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 334, in load_state_dict
    return super().load_state_dict(state_dict, strict)
  File "/home/jack/fairseq/fairseq/models/fairseq_model.py", line 115, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/home/jack/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel:
	size mismatch for model.partitions.0.0.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.0.0.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.2.26.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.2.26.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.5.51.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.5.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).

it seems to be a dict issue, but there are only data_dict.128k.txt and model_dict.128k.txt, and no combination of these during preprocessing and training appears to work.

bmtm avatar Feb 11 '21 06:02 bmtm

trying to generate with 4 rtx 3090:

fairseq-generate \
    bin \
    --batch-size 1 \
    --path 12b_last_chk_4_gpus.pt \
    --fixed-dictionary model_dict.128k.txt \
    -s en -t zh \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \
    --distributed-world-size 1 --distributed-no-spawn \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[1,15,10]' \
    --pipeline-encoder-devices '[0,1,0]' \
    --pipeline-decoder-balance '[3,11,11,1]' \
    --pipeline-decoder-devices '[0,2,3,0]'

but getting this error:


Traceback (most recent call last):
  File "/home/jack/miniconda3/envs/fairseq/bin/fairseq-generate", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')())
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 404, in cli_main
    main(args)
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 49, in main
    return _main(cfg, sys.stdout)
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 96, in _main
    models, saved_cfg = checkpoint_utils.load_model_ensemble(
  File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble
    ensemble, args, _task = load_model_ensemble_and_task(
  File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 355, in load_model_ensemble_and_task
    model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
  File "/home/jack/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 334, in load_state_dict
    return super().load_state_dict(state_dict, strict)
  File "/home/jack/fairseq/fairseq/models/fairseq_model.py", line 115, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/home/jack/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel:
	size mismatch for model.partitions.0.0.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.0.0.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.2.26.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.2.26.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.5.51.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.5.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).

it seems to be a dict issue, but there are only data_dict.128k.txt and model_dict.128k.txt, and no combination of these during preprocessing and training appears to work.

Same for me on 4x TESLA T4...

jknafou avatar Mar 26 '21 14:03 jknafou

@shruti-bh run M2M_100 12B model, I use 8 Tesla T4 GPUs,but It's very inefficient. (0.34 sentences/s, 9.99 tokens/s). I want to be more efficient,help me. These are my parameters

fairseq-generate
en-zh/data_bin
--batch-size 1
--path ../../12b_last_chk_8_gpus/12b_last_chk_8_gpus.pt
--fixed-dictionary model_dict.128k.txt
-s en -t zh
--remove-bpe 'sentencepiece'
--beam 5
--task translation_multi_simple_epoch
--lang-pairs language_pairs.txt
--decoder-langtok --encoder-langtok src
--gen-subset test
--fp16
--dataset-impl mmap
--distributed-world-size 1 --distributed-no-spawn
--pipeline-model-parallel
--pipeline-chunks 1
--pipeline-encoder-balance '[1,6,6,6,7]'
--pipeline-encoder-devices '[0,4,5,1,0]'
--pipeline-decoder-balance '[1,6,6,6,6,1]'
--pipeline-decoder-devices '[0,2,6,7,3,0]' \

im-yangp avatar Apr 15 '21 10:04 im-yangp

trying to generate with 4 rtx 3090:

fairseq-generate \
    bin \
    --batch-size 1 \
    --path 12b_last_chk_4_gpus.pt \
    --fixed-dictionary model_dict.128k.txt \
    -s en -t zh \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \
    --distributed-world-size 1 --distributed-no-spawn \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[1,15,10]' \
    --pipeline-encoder-devices '[0,1,0]' \
    --pipeline-decoder-balance '[3,11,11,1]' \
    --pipeline-decoder-devices '[0,2,3,0]'

but getting this error:


Traceback (most recent call last):
  File "/home/jack/miniconda3/envs/fairseq/bin/fairseq-generate", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')())
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 404, in cli_main
    main(args)
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 49, in main
    return _main(cfg, sys.stdout)
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 96, in _main
    models, saved_cfg = checkpoint_utils.load_model_ensemble(
  File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble
    ensemble, args, _task = load_model_ensemble_and_task(
  File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 355, in load_model_ensemble_and_task
    model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
  File "/home/jack/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 334, in load_state_dict
    return super().load_state_dict(state_dict, strict)
  File "/home/jack/fairseq/fairseq/models/fairseq_model.py", line 115, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/home/jack/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel:
	size mismatch for model.partitions.0.0.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.0.0.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.2.26.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.2.26.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.5.51.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.5.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).

it seems to be a dict issue, but there are only data_dict.128k.txt and model_dict.128k.txt, and no combination of these during preprocessing and training appears to work.

Same for me on 4x TESLA T4...

Same for me on 8*V100, is it slovled?

Hanlard avatar Jun 10 '21 08:06 Hanlard

trying to generate with 4 rtx 3090:

fairseq-generate \
    bin \
    --batch-size 1 \
    --path 12b_last_chk_4_gpus.pt \
    --fixed-dictionary model_dict.128k.txt \
    -s en -t zh \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \
    --distributed-world-size 1 --distributed-no-spawn \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[1,15,10]' \
    --pipeline-encoder-devices '[0,1,0]' \
    --pipeline-decoder-balance '[3,11,11,1]' \
    --pipeline-decoder-devices '[0,2,3,0]'

but getting this error:


Traceback (most recent call last):
  File "/home/jack/miniconda3/envs/fairseq/bin/fairseq-generate", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')())
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 404, in cli_main
    main(args)
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 49, in main
    return _main(cfg, sys.stdout)
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 96, in _main
    models, saved_cfg = checkpoint_utils.load_model_ensemble(
  File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble
    ensemble, args, _task = load_model_ensemble_and_task(
  File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 355, in load_model_ensemble_and_task
    model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
  File "/home/jack/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 334, in load_state_dict
    return super().load_state_dict(state_dict, strict)
  File "/home/jack/fairseq/fairseq/models/fairseq_model.py", line 115, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/home/jack/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel:
	size mismatch for model.partitions.0.0.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.0.0.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.2.26.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.2.26.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.5.51.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.5.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).

it seems to be a dict issue, but there are only data_dict.128k.txt and model_dict.128k.txt, and no combination of these during preprocessing and training appears to work.

Same for me on 4x TESLA T4...

Same for me on 8*V100, is it slovled?

This may be a mistake in doc. Using language_pairs_small_models.txt instead of language_pairs.txt works for me.

My setup is 2*V100 32G SXM2, detailed script below:

src=it
tgt=zh
fairseq-generate \
    data-bin/m2m100.$src-$tgt \
    --batch-size 1 \
    --path 12b_last_chk_2_gpus.pt \
    --fixed-dictionary model_dict.128k.txt \
    -s $src -t $tgt \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs_small_models.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --distributed-world-size 1 --distributed-no-spawn \
    --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "29, 22, 1" , "pipeline_devices": "0, 1, 0" }' \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[26]' \
    --pipeline-encoder-devices '[0]' \
    --pipeline-decoder-balance '[3,22,1]' \
    --pipeline-decoder-devices '[0,1,0]'

Generation is extremely slow, wps=17, takes 24G/32G memory per GPU.

Maxwell-Lyu avatar Aug 21 '21 03:08 Maxwell-Lyu

trying to generate with 4 rtx 3090:

fairseq-generate \
    bin \
    --batch-size 1 \
    --path 12b_last_chk_4_gpus.pt \
    --fixed-dictionary model_dict.128k.txt \
    -s en -t zh \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \
    --distributed-world-size 1 --distributed-no-spawn \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[1,15,10]' \
    --pipeline-encoder-devices '[0,1,0]' \
    --pipeline-decoder-balance '[3,11,11,1]' \
    --pipeline-decoder-devices '[0,2,3,0]'

but getting this error:


Traceback (most recent call last):
  File "/home/jack/miniconda3/envs/fairseq/bin/fairseq-generate", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')())
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 404, in cli_main
    main(args)
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 49, in main
    return _main(cfg, sys.stdout)
  File "/home/jack/fairseq/fairseq_cli/generate.py", line 96, in _main
    models, saved_cfg = checkpoint_utils.load_model_ensemble(
  File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble
    ensemble, args, _task = load_model_ensemble_and_task(
  File "/home/jack/fairseq/fairseq/checkpoint_utils.py", line 355, in load_model_ensemble_and_task
    model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
  File "/home/jack/fairseq/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py", line 334, in load_state_dict
    return super().load_state_dict(state_dict, strict)
  File "/home/jack/fairseq/fairseq/models/fairseq_model.py", line 115, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/home/jack/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineParallelTransformerModel:
	size mismatch for model.partitions.0.0.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.0.0.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.2.26.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.2.26.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.5.51.embed_tokens.0.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).
	size mismatch for model.partitions.5.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).

it seems to be a dict issue, but there are only data_dict.128k.txt and model_dict.128k.txt, and no combination of these during preprocessing and training appears to work.

Same for me on 4x TESLA T4...

Same for me on 8*V100, is it slovled?

After switching to fairseq==0.10.2 version (python=3.7, pip install fairscale), instead of the current master version on the github, I am able to run it with following commands on 4 * RTX6000:

fairseq-generate data_bin --batch-size 1 --path 12b_last_chk_4_gpus.pt --fixed-dictionary model_dict.128k.txt -s de -t fr --remove-bpe 'sentencepiece' --beam 5 --task translation_multi_simple_epoch --lang-pairs language_pairs.txt --decoder-langtok --encoder-langtok src --gen-subset test --fp16 --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' --dataset-impl mmap --distributed-world-size 1 --distributed-no-spawn --pipeline-model-parallel --pipeline-chunks 1 --pipeline-encoder-balance '[1,15,10]' --pipeline-encoder-devices '[0,1,0]' --pipeline-decoder-balance '[3,11,11,1]' --pipeline-decoder-devices '[0,2,3,0]'

output:

2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | [de] dictionary: 128112 types 2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | [fr] dictionary: 128112 types 2021-11-29 13:57:11 | INFO | fairseq.tasks.translation_multi_simple_epoch | loading data for test epoch=1/None 2021-11-29 13:57:11 | INFO | fairseq.tasks.translation_multi_simple_epoch | mem usage: N/A 2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | langtoks settings: {'main': ('src', 'tgt')} 2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | [test] num of shards: {'main:de-fr': 1} 2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | main:de-fr src_langtok: 128020; tgt_langtok: 128028 2021-11-29 13:57:11 | INFO | fairseq.data.data_utils | loaded 20 examples from: data_bin/test.de-fr.de 2021-11-29 13:57:11 | INFO | fairseq.data.data_utils | loaded 20 examples from: data_bin/test.de-fr.fr 2021-11-29 13:57:11 | INFO | fairseq.data.multilingual.multilingual_data_manager | data_bin test de-fr 20 examples 2021-11-29 13:57:11 | INFO | fairseq_cli.generate | loading model(s) from 12b_last_chk_4_gpus.pt 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | start batch sampler: mem usage: N/A 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | [test] @batch_sampler order indices time: 0:00:00.002948 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | mem usage: N/A 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | [test] @batch_sampler filter_by_size time: 0:00:00.000244 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | mem usage: N/A 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | [test] @batch_sampler batch_by_size time: 0:00:00.003375 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | [test] per epoch batch_sampler set-up time: 0:00:00.007069 2021-11-29 14:00:29 | INFO | fairseq.tasks.translation_multi_simple_epoch | mem usage: N/A 0%| | 0/20 [00:00<?, ?it/s] unfin_idx = idx // beam_size S-9 __de__ Der Parteitag spendete ihr Genesungswünsche. T-9 Le congrès du parti lui a souhaité un bon rétablissement. H-9 -1.0958120822906494 Le congrès a fait des vœux de rétablissement. D-9 -1.0958120822906494 Le congrès a fait des vœux de rétablissement. P-9 -5.5446 -1.0863 -1.8819 -0.0843 -1.4484 -2.5844 -1.6294 -0.7081 -0.0993 -0.1612 -0.2383 -0.9995 -0.5417 -0.1964 -0.1850 -0.1443 5%|█████▊

edchengg avatar Nov 29 '21 05:11 edchengg

@all, does the generation only works on 32GB gpu ? What if I have got 48GB gpus?

nikhiljaiswal avatar Dec 04 '21 10:12 nikhiljaiswal

I have already used m2m_100 successfully last year. Now I tried to generate ro-en translation with the current fairseq version and the m2m100 models for 6 and 8 GPUs. However, I get the same errors as the ones described above, about size mismatch: size mismatch for model.partitions.9.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).

Then I reverted to a older version of the repo of which I thought that it could work. There I got another error: fairseq/fairseq/data/multilingual/multilingual_data_manager.py", line 903, in get_split_data_param_list paths, epoch, shard_epoch, split_num_shards_dict[key] KeyError: 'main:ro-en'

Could you help me?

damyana79 avatar Apr 03 '22 19:04 damyana79

I have already used m2m_100 successfully last year. Now I tried to generate ro-en translation with the current fairseq version and the m2m100 models for 6 and 8 GPUs. However, I get the same errors as the ones described above, about size mismatch: size mismatch for model.partitions.9.51.embed_tokens.1.weight: copying a param with shape torch.Size([128112, 2048]) from checkpoint, the shape in current model is torch.Size([128139, 2048]).

Then I reverted to a older version of the repo of which I thought that it could work. There I got another error: fairseq/fairseq/data/multilingual/multilingual_data_manager.py", line 903, in get_split_data_param_list paths, epoch, shard_epoch, split_num_shards_dict[key] KeyError: 'main:ro-en'

Could you help me?

huggingface suports m2m now

edchengg avatar Apr 03 '22 20:04 edchengg

@edchengg , thanks a lot for your reply! I was trying something with an own class built in a fork of fairseq, so for me it would have been better to work further with fairseq. I will see if I can put my code also somewhere inside a huggingface fork or if I train an own ro-en model.

damyana79 avatar Apr 10 '22 11:04 damyana79

Hello. I am getting the same hydra error for finetuning the large (12B) model. I am using 2 32 GB GPUs and the 2 GPU pretrained checkpoint with the pipeline parameters set as --pipeline-balance '[29,22,1]' --pipeline-devices '[0,1,0]'. The error I am getting is hydra.errors.ConfigCompositionException: Error merging override distributed_training.ddp_backend='simple'. As there is no --model-overrides option in train, any suggestion what I should do. Thank you.

abdulrafae avatar Jul 27 '22 13:07 abdulrafae