Megatron-LM
Megatron-LM copied to clipboard
[BUG] arguments of get_cpu_offload_context() in transformer_engine.py for different version of te
Describe the bug When I use nvcr.io/nvidia/pytorch:24.07 to run run_simple_mcore_train_loop.py in commit 094d66b(newest) It seems like some wrong in megatron/core/transformer/custom_layers/transformer_engine.py get_cpu_offload_context() for the version of transformer-engine the version of transformer-engine in nvcr.io/nvidia/pytorch:24.07 is 1.8.0+37280ec
To Reproduce
PYTHONPATH=$PYTHON_PATH:./megatron torchrun --nproc-per-node 2 examples/run_simple_mcore_train_loop.py
Stack trace/logs
[rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/megatron/examples/run_simple_mcore_train_loop.py", line 121, in <module>
[rank1]: gpt_model = model_provider()
[rank1]: File "/workspace/megatron/examples/run_simple_mcore_train_loop.py", line 47, in model_provider
[rank1]: gpt_model = GPTModel(
[rank1]: File "/workspace/megatron/megatron/core/models/gpt/gpt_model.py", line 101, in __init__
[rank1]: self.decoder = TransformerBlock(
[rank1]: File "/workspace/megatron/megatron/core/transformer/transformer_block.py", line 148, in __init__
[rank1]: get_cpu_offload_context(
[rank1]: File "/workspace/megatron/megatron/core/transformer/custom_layers/transformer_engine.py", line 898, in get_cpu_offload_context
[rank1]: context, sync_func = _get_cpu_offload_context(
[rank1]: **TypeError: get_cpu_offload_context() takes from 0 to 4 positional arguments but 5 were given**
Environment (please complete the following information): nvcr.io/nvidia/pytorch:24.07 transformer-engine 1.8.0+37280ec
Proposed fix some wrong in megatron/core/transformer/custom_layers/transformer_engine.py get_cpu_offload_context() https://github.com/NVIDIA/Megatron-LM/blob/094d66b488514beaac2106c3e0f9581d27ea9533/megatron/core/transformer/custom_layers/transformer_engine.py#L890-L904 PR https://github.com/NVIDIA/Megatron-LM/pull/996