NeMo
NeMo copied to clipboard
Megatron: NameError: name 'ensure_divisibility' is not defined
Describe the bug
Trying to run megatron_gpt_pretraining.py. During the init, the NameError occurs, Traceback below:
Traceback (most recent call last):
File "/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 92, in <module>
main()
File "/NeMo/nemo/core/config/hydra_runner.py", line 104, in wrapper
_run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 86, in main
model = MegatronGPTModel(cfg.model, trainer)
File "/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 79, in __init__
super().__init__(cfg, trainer=trainer, no_lm_init=True)
File "/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 73, in __init__
initialize_model_parallel_for_nemo(
File "/NeMo/nemo/collections/nlp/modules/common/megatron/megatron_init.py", line 68, in initialize_model_parallel_for_nemo
) = fake_initialize_model_parallel(
File "/NeMo/nemo/collections/nlp/modules/common/megatron/megatron_init.py", line 181, in fake_initialize_model_parallel
ensure_divisibility(world_size, tensor_model_parallel_size * pipeline_model_parallel_size)
NameError: name 'ensure_divisibility' is not defined
Expected behavior
Manually importing ensure_divisibility from apex.transformer.utils works fine. Running the code should properly import it as well.
Environment overview (please complete the following information)
singularity image, based on nvcr.io/nvidia/pytorch:22.06-py3 docker image, NeMo installed with
git clone https://github.com/NVIDIA/NeMo
cd NeMo
python -m pip install -r requirements/requirements.txt \
-r requirements/requirements_common.txt \
-r requirements/requirements_lightning.txt \
-r requirements/requirements_nlp.txt \
-r requirements/requirements_test.txt
./reinstall.sh
Update Update: no issue when using 22.05 image AND NeMo release v1.10.0. I did not yet check which of the two changes is responsible for the issue.
So, the 22.05 container and r1.10.0 works, but 22.06 and the current main branch does not?
This issue is stale because it has been open for 60 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.