tutel icon indicating copy to clipboard operation
tutel copied to clipboard

INTERNAL ASSERT FAILED

Open Qicheng-WANG opened this issue 2 years ago • 5 comments

Hi there, When I ran a quick test "python3 -m tutel.examples.helloworld --batch_size=16", it showed error as follow: RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp":46, please report a bug to PyTorch. CHECK_EQ fails. Could you help me fix it?Thanks

Qicheng-WANG avatar May 02 '23 06:05 Qicheng-WANG

It also showed image I am using NVIDIA 3090 and CUDA11.3

Qicheng-WANG avatar May 02 '23 07:05 Qicheng-WANG

  1. Does print(torch.cuda.get_arch_list()) include sm_86?
  2. Can you try export USE_NVRTC=1 before running the example?
  3. Are you sure there is no other old CUDA installed so that an old nvcc command was wrongly called for this compilation?

ghostplant avatar May 02 '23 16:05 ghostplant

  1. Does print(torch.cuda.get_arch_list()) include sm_86?
  2. Can you try export USE_NVRTC=1 before running the example?
  3. Are you sure there is no other old CUDA installed so that an old nvcc command was wrongly called for this compilation?

Hi! I am running tutel in jetson nano b01 (4GB version) I also meet problem "RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp".

In the nano computer, 1.print(torch.cuda.get_arch_list() is ['sm_53', 'sm_62', 'sm72'] 2. I use export USE_NVRTC=1, but another error occurred. 3. My nvcc version is 10.2.3

monster119120 avatar Aug 11 '23 08:08 monster119120

This is the problem from Pytorch + CUDA not tutel. You need a pytorch built with at least cu117/118 so that torch.cuda.get_arch_list() should include sm_86. You also need to update your CUDA SDK (e.g. to 12.0) since NVDIA's new GPU is not compatible with its older NVCC SDK.

ghostplant avatar Aug 12 '23 06:08 ghostplant

CUDA 10.2.3 is too old and it cannot support any new GPU that is above V100 (sm_7x). CUDA 11 should support A100 related types and CUDA 12 should support H100 related types. After upgrading CUDA SDK, please also reinstall pytorch that is built upon at least cu118.

ghostplant avatar Aug 12 '23 06:08 ghostplant