pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

Megatron training adaptation issue, obtaining CUDA_SOME as None

Open JoJo-Lorray opened this issue 10 months ago • 1 comments

🐛 Describe the bug

Megatron training adaptation issue, obtaining CUDA_SOME as None

# running megatron training, loading fused kernels reports an error, the core code is as follows
from torch.utils import cpp_extension
raw_output = subprocess.check_output(
        # cuda_dir it was obtained based on cpp_extension.CUDA_HOME, but the value is none
        [cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True
    )

Versions

# The error code is located at torch.utils.cpp_extension this module, the method of obtaining CUDA_HOME is incorrect
# Due to the fact that the ROCM_HOME variable is never None no matter what, therefore, the IS_HIP_EXTENSION variable is always True
IS_HIP_EXTENSION = True if ((ROCM_HOME is not None) and (torch.version.hip is not None)) else False
# As a result, the CUDA_HOME variable always remains None
CUDA_HOME = (
    _find_cuda_home() if ((not IS_HIP_EXTENSION) and (torch.cuda._is_compiled())) else None
)

JoJo-Lorray avatar Feb 27 '25 07:02 JoJo-Lorray

Hi @JoJo-Lorray. Internal ticket has been created to investigate your issue. Thanks!

ppanchad-amd avatar Feb 27 '25 16:02 ppanchad-amd

Hi @JoJo-Lorray , The latest PyTorch logic is now correct (https://github.com/ROCm/pytorch/blob/0a6e1d6b9bf78d690a812e4334939e7701bfa794/torch/utils/cpp_extension.py#L243C1-L244C1) — it uses torch.cuda._is_compiled() to set CUDA_HOME, and no longer incorrectly relies on IS_HIP_EXTENSION.

On ROCm, torch.cuda._is_compiled() returns False, so CUDA_HOME = None ( expected behavior).

The crash you're seeing is in Megatron, which assumes CUDA_HOME is always valid and tries to run:

subprocess.check_output([CUDA_HOME + "/bin/nvcc", "-V"]) ...without checking if CUDA_HOME is None. This could cause a TypeError on ROCm systems.

In your training script/megatron: from torch.utils.cpp_extension import CUDA_HOME if CUDA_HOME is not None: subprocess.check_output([CUDA_HOME + "/bin/nvcc", "-V"]) else: print("Skipping CUDA logic on ROCm")

adityas-amd avatar Jul 16 '25 22:07 adityas-amd