tutel icon indicating copy to clipboard operation
tutel copied to clipboard

RuntimeError: (true) == (fp != nullptr) INTERNAL ASSERT FAILED at "/home/xxx/tutel/tutel/custom/custom_kernel.cpp":33, please report a bug to PyTorch. CHECK_EQ fails.

Open numliu opened this issue 9 months ago • 5 comments

After I packaged Tutel with pyinstaller RuntimeError: (true) == (fp != nullptr) INTERNAL ASSERT FAILED at "/home/xxx/tutel/tutel/custom/custom_kernel.cpp":33, please report a bug to PyTorch. CHECK_EQ fails. I tried to print fp and found that fp was empty and found that in custom_kernel.cpp line 102 std::string fatbin_path = code_path std::string(".fatbin"); The generated xxx.cu.fatbin file program is not accessible, but I can access it with the ls command

numliu avatar Mar 27 '25 03:03 numliu

Is it from an latest Tutel version? I didn't see it matches the error below:

https://github.com/microsoft/Tutel/blob/main/tutel/custom/custom_kernel.cpp#L33

Another question is if you are using CUDA backend or ROCm backend?

ghostplant avatar Mar 27 '25 12:03 ghostplant

tutel 0.2.x and using CUDA

numliu avatar Mar 28 '25 03:03 numliu

I used pyinstaller to package libraries such as tutel for training, and found that pid_t pid = fork(); if (pid == 0) { #if !defined(__HIP_PLATFORM_HCC__) && !defined(__HIP_PLATFORM_AMD__) CHECK_EQ(-1, execl(entry.c_str(), entry.c_str(), code_path, "-o", fatbin_path.c_str(), "--fatbin", "-O4", "-gencode", ("arch=compute_" arch ",code=sm_" arch).c_str(), (char *)NULL) ); #else CHECK_EQ(-1, execl(entry.c_str(), entry.c_str(), code_path, "-o", fatbin_path.c_str(), "--genco", "-O4", "-w" , ("--amdgpu-target=" arch).c_str(), (char *)NULL)); #endif exit(1); } else { wait(NULL); } The parent process cannot access the .fatbin file created by the child process

numliu avatar Mar 28 '25 07:03 numliu

Can you set export USE_NVRTC=1 and retry again?

ghostplant avatar Mar 28 '25 12:03 ghostplant