Megatron training adaptation issue, obtaining CUDA_SOME as None
🐛 Describe the bug
Megatron training adaptation issue, obtaining CUDA_SOME as None
# running megatron training, loading fused kernels reports an error, the core code is as follows
from torch.utils import cpp_extension
raw_output = subprocess.check_output(
# cuda_dir it was obtained based on cpp_extension.CUDA_HOME, but the value is none
[cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True
)
Versions
# The error code is located at torch.utils.cpp_extension this module, the method of obtaining CUDA_HOME is incorrect
# Due to the fact that the ROCM_HOME variable is never None no matter what, therefore, the IS_HIP_EXTENSION variable is always True
IS_HIP_EXTENSION = True if ((ROCM_HOME is not None) and (torch.version.hip is not None)) else False
# As a result, the CUDA_HOME variable always remains None
CUDA_HOME = (
_find_cuda_home() if ((not IS_HIP_EXTENSION) and (torch.cuda._is_compiled())) else None
)
Hi @JoJo-Lorray. Internal ticket has been created to investigate your issue. Thanks!
Hi @JoJo-Lorray , The latest PyTorch logic is now correct (https://github.com/ROCm/pytorch/blob/0a6e1d6b9bf78d690a812e4334939e7701bfa794/torch/utils/cpp_extension.py#L243C1-L244C1) — it uses torch.cuda._is_compiled() to set CUDA_HOME, and no longer incorrectly relies on IS_HIP_EXTENSION.
On ROCm, torch.cuda._is_compiled() returns False, so CUDA_HOME = None ( expected behavior).
The crash you're seeing is in Megatron, which assumes CUDA_HOME is always valid and tries to run:
subprocess.check_output([CUDA_HOME + "/bin/nvcc", "-V"])
...without checking if CUDA_HOME is None. This could cause a TypeError on ROCm systems.
In your training script/megatron:
from torch.utils.cpp_extension import CUDA_HOME
if CUDA_HOME is not None:
subprocess.check_output([CUDA_HOME + "/bin/nvcc", "-V"])
else:
print("Skipping CUDA logic on ROCm")