NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

[NMT] Unable to run `megatron_nmt_training.py`

Open itzsimpl opened this issue 3 years ago • 8 comments

Describe the bug

Trying to run megatron_nmt_training.py, with conf/aayn_base_megatron.yml initialization crashes with the following trace:

Traceback (most recent call last):
  File "examples/nlp/machine_translation/megatron_nmt_training.py", line 95, in <module>
    main()
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/nlp/machine_translation/megatron_nmt_training.py", line 68, in main
    MTDataPreproc(cfg=cfg.model, trainer=trainer)
  File "/workspace/nemo/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py", line 172, in __init__
    encoder_model_name=cfg.encoder.get('model_name'),
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 357, in __getattr__
    self._format_and_raise(key=key, value=None, cause=e)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/base.py", line 190, in _format_and_raise
    format_and_raise(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 738, in format_and_raise
    _raise(ex, cause)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 716, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 438, in _get_impl
    node = self._get_node(key=key, throw_on_missing_key=True)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 465, in _get_node
    self._validate_get(key)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 166, in _validate_get
    self._format_and_raise(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/base.py", line 190, in _format_and_raise
    format_and_raise(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 818, in format_and_raise
    _raise(ex, cause)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/omegaconf/_utils.py", line 716, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigAttributeError: Key 'encoder' is not in struct
    full_key: model.encoder
    object_type=dict

@MaximumEntropy Tracking down the source, the offending line seems to be https://github.com/NVIDIA/NeMo/blob/09be885e3f96ccded893dabe5985ffa22736f111/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py#L172

as the encoder property is defined in all other NMT yaml files, but not in aayn_base_megatron.yml. This is true also for decoder. Hance the following lines will lead to a crash https://github.com/NVIDIA/NeMo/blob/09be885e3f96ccded893dabe5985ffa22736f111/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py#L172 https://github.com/NVIDIA/NeMo/blob/09be885e3f96ccded893dabe5985ffa22736f111/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py#L227 https://github.com/NVIDIA/NeMo/blob/09be885e3f96ccded893dabe5985ffa22736f111/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py#L177 https://github.com/NVIDIA/NeMo/blob/09be885e3f96ccded893dabe5985ffa22736f111/nemo/collections/nlp/data/machine_translation/preproc_mt_data.py#L232

Environment details

  • nemo:1.9.0 built on pytorch:22.05-py3 with ./reinstall.sh

itzsimpl avatar Jun 26 '22 08:06 itzsimpl

By changing the previously mentioned lines to sth. like:

    decoder_model_name=cfg.decoder.get('model_name') if cfg.get('decoder') else None,

the preprocessing step works fine, but then training crashes with the following trace:

Traceback (most recent call last):
  File "examples/nlp/machine_translation/megatron_nmt_training.py", line 95, in <module>
    main()
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/nlp/machine_translation/megatron_nmt_training.py", line 88, in main
    trainer.fit(model)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
    results = self._run_stage()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
    return self._run_train()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1343, in _run_train
    self._run_sanity_check()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1411, in _run_sanity_check
    val_loop.run()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 153, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 127, in advance
    output = self._evaluation_step(**kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 222, in _evaluation_step
    output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1763, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 347, in validation_step
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1129, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 989, in forward
    output = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1129, in _call_impl
    return forward_call(*input, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 93, in forward
    return self.module.validation_step(*inputs, **kwargs)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/megatron_nmt_model.py", line 220, in validation_step
    return self.eval_step(batch, batch_idx, dataloader_idx)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/megatron_nmt_model.py", line 158, in eval_step
    reduced_loss = super().validation_step(batch, batch_idx)
  File "/workspace/nemo/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 464, in validation_step
    batch_for_pipeline = self.process_global_batch(batch)
  File "/workspace/nemo/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 575, in process_global_batch
    global_batch["text_enc"],
TypeError: list indices must be integers or slices, not str

I was unable to track down the source of it.

itzsimpl avatar Jun 26 '22 10:06 itzsimpl

@MaximumEntropy @michalivne Did anyone have the chance to look int this?

itzsimpl avatar Jul 15 '22 10:07 itzsimpl

Apologies, I forgot to hit submit on a draft I had earlier.

Sorry, you're seeing this issue because megatron-based NMT is not supported properly with tarred datasets while it is still the default in the yaml config. It only works withdataset_type=text_memmap or dataset_type=bin_memmap. We will change these defaults and remove tarred datasets for Megatron NMT in one of the coming releases. For now, please use either text_memmap or bin_memmap.

python megatron_nmt_training.py -cn aayn_base_megatron
....
model.train_ds.dataset_type=text_memmap \
model.train_ds.src_file_name=train.en \
model.train_ds.src_file_name=train.de \

If you want to use binarized memmap (slightly faster and doesn't tokenized on-the-fly), you need to first run preprocess_data_for_megatron.py (https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py)

So

python preprocess_data_for_megatron.py --text_file --input train.en --tokenizer-library sentencepiece --tokenizer-model /path/to/your_tokenizer.model --output_prefix train-en-de.en

python preprocess_data_for_megatron.py --text_file --input train.de --tokenizer-library sentencepiece --tokenizer-model /path/to/your_tokenizer.model --output_prefix train-en-de.de

Then train with

python megatron_nmt_training.py -cn aayn_base_megatron
....
model.train_ds.dataset_type=bin_memmap \
model.train_ds.src_file_name=train-en-de.en_text_document \
model.train_ds.src_file_name=train-en-de.de_text_document \

MaximumEntropy avatar Jul 16 '22 03:07 MaximumEntropy

@MaximumEntropy sorry if this is taking long, I'm waiting for resources to clear, so as to be able to perform a test. I have built bin memmaps, I just need to see if training starts as it should. I'll report as soon as I'll be able to run.

itzsimpl avatar Jul 21 '22 13:07 itzsimpl

@MaximumEntropy using nvidia/pytorch:22.07-py3, nemo:1.11.rc0 at ce16320c8, running on 1 node with 4x A100 40GB, training fails with the following trace:

Traceback (most recent call last):
  File "examples/nlp/machine_translation/megatron_nmt_training.py", line 179, in <module>
    main()
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/nlp/machine_translation/megatron_nmt_training.py", line 170, in main
    model = MegatronNMTModel(cfg.model, trainer)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/megatron_nmt_model.py", line 90, in __init__
    super().__init__(cfg, trainer=trainer)
  File "/workspace/nemo/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 94, in __init__
    self.enc_dec_model = build_model(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/common.py", line 103, in build_model
    model = model_provider_func(*cur_args, **cur_kwargs)
  File "/workspace/nemo/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 137, in model_provider_func
    model = MegatronTokenLevelEncoderDecoderModule(
  File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/token_level_encoder_decoder.py", line 185, in __init__
    encoder = get_encoder_model(
  File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/megatron_encoders.py", line 99, in get_encoder_model
    encoder = MegatronTransformerEncoderModule(
  File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/megatron_transformer_encoder.py", line 97, in __init__
    self.model = ParallelTransformer(
  File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1664, in __init__
    self.layers = torch.nn.ModuleList([build_layer(i + 1 + offset) for i in range(self.num_layers)])
  File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1664, in <listcomp>
    self.layers = torch.nn.ModuleList([build_layer(i + 1 + offset) for i in range(self.num_layers)])
  File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1595, in build_layer
    return ParallelTransformerLayer(
  File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1433, in __init__
    super(ParallelTransformerLayer, self).__init__(**kwargs)
  File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/transformer.py", line 1097, in __init__
    self.input_layernorm = get_layer_norm(
  File "/workspace/nemo/nemo/collections/nlp/modules/common/megatron/fused_layer_norm.py", line 62, in get_layer_norm
    return MixedFusedLayerNorm(hidden_size, eps, sequence_parallel_enbaled=sequence_parallel)
NameError: name 'MixedFusedLayerNorm' is not defined

FWW. I'm training with the following command:

python examples/nlp/machine_translation/megatron_nmt_training.py \
  --config-path=conf \
  --config-name=aayn_base_megatron \
  trainer.devices=-1 \
  exp_manager.name=megatron_en-sl \
  +exp_manager.exp_dir=/experiments \
  +exp_manager.version=20220801-1941 \
  model.train_ds.dataset_type=bin_memmap \
  model.train_ds.src_file_name=/data/cjvt/v1.2.6/mmap/train.en_text_document \
  model.train_ds.tgt_file_name=/data/cjvt/v1.2.6/mmap/train.sl_text_document \
  model.train_ds.pin_memory=true \
  model.train_ds.tokens_in_batch=512 \
  model.validation_ds.src_file_name=/data/cjvt/v1.2.6/validation.en \
  model.validation_ds.tgt_file_name=/data/cjvt/v1.2.6/validation.sl \
  model.validation_ds.pin_memory=true \
  model.validation_ds.tokens_in_batch=512 \
  model.test_ds.src_file_name=/data/cjvt/v1.2.6/test.en \
  model.test_ds.tgt_file_name=/data/cjvt/v1.2.6/test.sl \
  model.test_ds.pin_memory=true \
  model.test_ds.tokens_in_batch=512 \
  model.shared_tokenizer=false \
  model.encoder_tokenizer.model=/data/cjvt/v1.2.6/tokenizer/en_tokenizer.64000.BPE.model \
  model.decoder_tokenizer.model=/data/cjvt/v1.2.6/tokenizer/sl_tokenizer.64000.BPE.model \
  model.src_language=en \
  model.tgt_language=sl \
  trainer.precision=bf16

itzsimpl avatar Aug 02 '22 13:08 itzsimpl

Yeah this should happen for all megatron-based models in the 22.07 container. You need to install apex from the commit specified in our README on top of the 22.07 container.

MaximumEntropy avatar Aug 02 '22 16:08 MaximumEntropy

Specifically,

git clone https://github.com/NVIDIA/apex
cd apex
git checkout 3c19f1061879394f28272a99a7ea26d58f72dace
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" ./

MaximumEntropy avatar Aug 02 '22 17:08 MaximumEntropy

There's a bug in apex, https://github.com/NVIDIA/apex/blob/3c19f1061879394f28272a99a7ea26d58f72dace/apex/transformer/init.py does not import layers, hence https://github.com/NVIDIA/NeMo/blob/5c8fe3a443ce4bcde67560393882bc5f2c0601ea/nemo/collections/nlp/modules/common/megatron/fused_layer_norm.py#L18 will fail. I've opened a PR on apex https://github.com/NVIDIA/apex/pull/1442 with a fix.

itzsimpl avatar Aug 03 '22 00:08 itzsimpl

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Oct 02 '22 02:10 github-actions[bot]

The apex PR is still open.

itzsimpl avatar Oct 02 '22 09:10 itzsimpl

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Nov 02 '22 02:11 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Nov 10 '22 02:11 github-actions[bot]