NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Megatron: NameError: name 'ensure_divisibility' is not defined

Open MatejUlcar opened this issue 3 years ago • 1 comments

Describe the bug

Trying to run megatron_gpt_pretraining.py. During the init, the NameError occurs, Traceback below:

Traceback (most recent call last):
  File "/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 92, in <module>
    main()
  File "/NeMo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 86, in main
    model = MegatronGPTModel(cfg.model, trainer)
  File "/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 79, in __init__
    super().__init__(cfg, trainer=trainer, no_lm_init=True)
  File "/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 73, in __init__
    initialize_model_parallel_for_nemo(
  File "/NeMo/nemo/collections/nlp/modules/common/megatron/megatron_init.py", line 68, in initialize_model_parallel_for_nemo
    ) = fake_initialize_model_parallel(
  File "/NeMo/nemo/collections/nlp/modules/common/megatron/megatron_init.py", line 181, in fake_initialize_model_parallel
    ensure_divisibility(world_size, tensor_model_parallel_size * pipeline_model_parallel_size)
NameError: name 'ensure_divisibility' is not defined

Expected behavior

Manually importing ensure_divisibility from apex.transformer.utils works fine. Running the code should properly import it as well.

Environment overview (please complete the following information)

singularity image, based on nvcr.io/nvidia/pytorch:22.06-py3 docker image, NeMo installed with

git clone https://github.com/NVIDIA/NeMo
cd NeMo
python -m pip install -r requirements/requirements.txt \
            -r requirements/requirements_common.txt \
            -r requirements/requirements_lightning.txt \
            -r requirements/requirements_nlp.txt \
            -r requirements/requirements_test.txt
./reinstall.sh

Update Update: no issue when using 22.05 image AND NeMo release v1.10.0. I did not yet check which of the two changes is responsible for the issue.

MatejUlcar avatar Jul 08 '22 11:07 MatejUlcar

So, the 22.05 container and r1.10.0 works, but 22.06 and the current main branch does not?

MaximumEntropy avatar Jul 16 '22 03:07 MaximumEntropy

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Sep 28 '22 02:09 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Oct 07 '22 02:10 github-actions[bot]