NeMo
NeMo copied to clipboard
[NMT] Unable to run `megatron_nmt_training.py`
Describe the bug
Trying to run megatron_nmt_training.py, with conf/aayn_base_megatron.yml initialization crashes with the following trace:
Traceback (most recent call last):
File "examples/nlp/machine_translation/megatron_nmt_training.py", line 95, in <module>
main()
File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
_run_hydra(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "examples/nlp/machine_translation/megatron_nmt_training.py", line 68, in main
MTDataPreproc(cfg=cfg.model, trainer=trainer)
File "/workspace/nemo/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py", line 172, in __init__
encoder_model_name=cfg.encoder.get('model_name'),
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 357, in __getattr__
self._format_and_raise(key=key, value=None, cause=e)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/base.py", line 190, in _format_and_raise
format_and_raise(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 738, in format_and_raise
_raise(ex, cause)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 716, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set end OC_CAUSE=1 for full backtrace
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 438, in _get_impl
node = self._get_node(key=key, throw_on_missing_key=True)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 465, in _get_node
self._validate_get(key)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 166, in _validate_get
self._format_and_raise(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/base.py", line 190, in _format_and_raise
format_and_raise(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 818, in format_and_raise
_raise(ex, cause)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 716, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigAttributeError: Key 'encoder' is not in struct
full_key: model.encoder
object_type=dict
@MaximumEntropy Tracking down the source, the offending line seems to be https://github.com/NVIDIA/NeMo/blob/09be885e3f96ccded893dabe5985ffa22736f111/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py#L172
as the encoder property is defined in all other NMT yaml files, but not in aayn_base_megatron.yml. This is true also for decoder. Hance the following lines will lead to a crash
https://github.com/NVIDIA/NeMo/blob/09be885e3f96ccded893dabe5985ffa22736f111/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py#L172
https://github.com/NVIDIA/NeMo/blob/09be885e3f96ccded893dabe5985ffa22736f111/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py#L227
https://github.com/NVIDIA/NeMo/blob/09be885e3f96ccded893dabe5985ffa22736f111/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py#L177
https://github.com/NVIDIA/NeMo/blob/09be885e3f96ccded893dabe5985ffa22736f111/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py#L232
Environment details
nemo:1.9.0built onpytorch:22.05-py3with./reinstall.sh
By changing the previously mentioned lines to sth. like:
decoder_model_name=cfg.decoder.get('model_name') if cfg.get('decoder') else None,
the preprocessing step works fine, but then training crashes with the following trace:
Traceback (most recent call last):
File "examples/nlp/machine_translation/megatron_nmt_training.py", line 95, in <module>
main()
File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
_run_hydra(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "examples/nlp/machine_translation/megatron_nmt_training.py", line 88, in main
trainer.fit(model)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
results = self._run_stage()
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
return self._run_train()
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1343, in _run_train
self._run_sanity_check()
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1411, in _run_sanity_check
val_loop.run()
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 153, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 127, in advance
output = self._evaluation_step(**kwargs)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 222, in _evaluation_step
output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1763, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 347, in validation_step
return self.model(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1129, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 989, in forward
output = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1129, in _call_impl
return forward_call(*input, **kwargs)
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 93, in forward
return self.module.validation_step(*inputs, **kwargs)
File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/megatron_nmt_model.py", line 220, in validation_step
return self.eval_step(batch, batch_idx, dataloader_idx)
File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/megatron_nmt_model.py", line 158, in eval_step
reduced_loss = super().validation_step(batch, batch_idx)
File "/workspace/nemo/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 464, in validation_step
batch_for_pipeline = self.process_global_batch(batch)
File "/workspace/nemo/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 575, in process_global_batch
global_batch["text_enc"],
TypeError: list indices must be integers or slices, not str
I was unable to track down the source of it.
@MaximumEntropy @michalivne Did anyone have the chance to look int this?
Apologies, I forgot to hit submit on a draft I had earlier.
Sorry, you're seeing this issue because megatron-based NMT is not supported properly with tarred datasets while it is still the default in the yaml config. It only works withdataset_type=text_memmap or dataset_type=bin_memmap. We will change these defaults and remove tarred datasets for Megatron NMT in one of the coming releases. For now, please use either text_memmap or bin_memmap.
python megatron_nmt_training.py -cn aayn_base_megatron
....
model.train_ds.dataset_type=text_memmap \
model.train_ds.src_file_name=train.en \
model.train_ds.src_file_name=train.de \
If you want to use binarized memmap (slightly faster and doesn't tokenized on-the-fly), you need to first run preprocess_data_for_megatron.py (https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py)
So
python preprocess_data_for_megatron.py --text_file --input train.en --tokenizer-library sentencepiece --tokenizer-model /path/to/your_tokenizer.model --output_prefix train-en-de.en
python preprocess_data_for_megatron.py --text_file --input train.de --tokenizer-library sentencepiece --tokenizer-model /path/to/your_tokenizer.model --output_prefix train-en-de.de
Then train with
python megatron_nmt_training.py -cn aayn_base_megatron
....
model.train_ds.dataset_type=bin_memmap \
model.train_ds.src_file_name=train-en-de.en_text_document \
model.train_ds.src_file_name=train-en-de.de_text_document \
@MaximumEntropy sorry if this is taking long, I'm waiting for resources to clear, so as to be able to perform a test. I have built bin memmaps, I just need to see if training starts as it should. I'll report as soon as I'll be able to run.
@MaximumEntropy using nvidia/pytorch:22.07-py3, nemo:1.11.rc0 at ce16320c8, running on 1 node with 4x A100 40GB, training fails with the following trace:
Traceback (most recent call last):
File "examples/nlp/machine_translation/megatron_nmt_training.py", line 179, in <module>
main()
File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
_run_hydra(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "examples/nlp/machine_translation/megatron_nmt_training.py", line 170, in main
model = MegatronNMTModel(cfg.model, trainer)
File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/megatron_nmt_model.py", line 90, in __init__
super().__init__(cfg, trainer=trainer)
File "/workspace/nemo/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 94, in __init__
self.enc_dec_model = build_model(
File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/common.py", line 103, in build_model
model = model_provider_func(*cur_args, **cur_kwargs)
File "/workspace/nemo/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 137, in model_provider_func
model = MegatronTokenLevelEncoderDecoderModule(
File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/token_level_encoder_decoder.py", line 185, in __init__
encoder = get_encoder_model(
File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/megatron_encoders.py", line 99, in get_encoder_model
encoder = MegatronTransformerEncoderModule(
File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/megatron_transformer_encoder.py", line 97, in __init__
self.model = ParallelTransformer(
File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1664, in __init__
self.layers = torch.nn.ModuleList([build_layer(i + 1 + offset) for i in range(self.num_layers)])
File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1664, in <listcomp>
self.layers = torch.nn.ModuleList([build_layer(i + 1 + offset) for i in range(self.num_layers)])
File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1595, in build_layer
return ParallelTransformerLayer(
File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1433, in __init__
super(ParallelTransformerLayer, self).__init__(**kwargs)
File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1097, in __init__
self.input_layernorm = get_layer_norm(
File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/fused_layer_norm.py", line 62, in get_layer_norm
return MixedFusedLayerNorm(hidden_size, eps, sequence_parallel_enbaled=sequence_parallel)
NameError: name 'MixedFusedLayerNorm' is not defined
FWW. I'm training with the following command:
python examples/nlp/machine_translation/megatron_nmt_training.py \
--config-path=conf \
--config-name=aayn_base_megatron \
trainer.devices=-1 \
exp_manager.name=megatron_en-sl \
+exp_manager.exp_dir=/experiments \
+exp_manager.version=20220801-1941 \
model.train_ds.dataset_type=bin_memmap \
model.train_ds.src_file_name=/data/cjvt/v1.2.6/mmap/train.en_text_document \
model.train_ds.tgt_file_name=/data/cjvt/v1.2.6/mmap/train.sl_text_document \
model.train_ds.pin_memory=true \
model.train_ds.tokens_in_batch=512 \
model.validation_ds.src_file_name=/data/cjvt/v1.2.6/validation.en \
model.validation_ds.tgt_file_name=/data/cjvt/v1.2.6/validation.sl \
model.validation_ds.pin_memory=true \
model.validation_ds.tokens_in_batch=512 \
model.test_ds.src_file_name=/data/cjvt/v1.2.6/test.en \
model.test_ds.tgt_file_name=/data/cjvt/v1.2.6/test.sl \
model.test_ds.pin_memory=true \
model.test_ds.tokens_in_batch=512 \
model.shared_tokenizer=false \
model.encoder_tokenizer.model=/data/cjvt/v1.2.6/tokenizer/en_tokenizer.64000.BPE.model \
model.decoder_tokenizer.model=/data/cjvt/v1.2.6/tokenizer/sl_tokenizer.64000.BPE.model \
model.src_language=en \
model.tgt_language=sl \
trainer.precision=bf16
Yeah this should happen for all megatron-based models in the 22.07 container. You need to install apex from the commit specified in our README on top of the 22.07 container.
Specifically,
git clone https://github.com/NVIDIA/apex
cd apex
git checkout 3c19f1061879394f28272a99a7ea26d58f72dace
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" ./
There's a bug in apex, https://github.com/NVIDIA/apex/blob/3c19f1061879394f28272a99a7ea26d58f72dace/apex/transformer/init.py does not import layers, hence https://github.com/NVIDIA/NeMo/blob/5c8fe3a443ce4bcde67560393882bc5f2c0601ea/nemo/collections/nlp/modules/common/megatron/fused_layer_norm.py#L18 will fail. I've opened a PR on apex https://github.com/NVIDIA/apex/pull/1442 with a fix.
This issue is stale because it has been open for 60 days with no activity.
The apex PR is still open.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.