ColossalAI [BUG]: stable diffusion 2.0 finetune error

🐛 Describe the bug

I adapt the example by replacing export MODEL_NAME="CompVis/stable-diffusion-v1-4" with export MODEL_NAME="stabilityai/stable-diffusion-2", then run the script and got following error.

RuntimeError: false INTERNAL ASSERT FAILED at "../c10/cuda/CUDAGraphsC10Utils.h":73, please report a bug to PyTorch. Unknown CUDA graph CaptureStatus32521

It seems like failure at optimizer.backward(loss) . Is there any help?

Environment

torch==1.13.1 ColossalAI==2.5.0

Apr 23 '23 07:04 Tengxu-Sun

Any help? Thanks a lot! @kurisusnowdeng @JThh @binmakeswell

Apr 24 '23 13:04 Tengxu-Sun

Hi, can you try some preliminary fixes by upgrading colossalai to the latest version? pip install colossalai should do.

Apr 24 '23 15:04 JThh

After that, can you run colossalai check -i and show us the message?

Apr 24 '23 15:04 JThh

Hi, can you try some preliminary fixes by upgrading colossalai to the latest version? pip install colossalai should do.

I have upgrade colossalai to the newest version of 0.2.8. It has the same error.

Apr 25 '23 02:04 Tengxu-Sun

After that, can you run colossalai check -i and show us the message?

OK

#### Installation Report ####

------------ Environment ------------
Colossal-AI version: 0.2.8
PyTorch version: 1.13.1
System CUDA version: 11.7
CUDA version required by PyTorch: 11.7

Note:
1. The table above checks the versions of the libraries/tools in the current environment
2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.

------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A

Note:
1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: ✓
System and Colossal-AI CUDA version match: N/A

Note:
1. The table above checks the version compatibility of the libraries/tools in the current environment
   - PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
   - System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
   - System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

Apr 25 '23 02:04 Tengxu-Sun

It seems to be an ongoing unresolved issue mentioned here. You might want to take some reference from it.

Apr 25 '23 09:04 JThh

Does it work when you run our default example?

Apr 25 '23 09:04 JThh

Does it work when you run our default example?

Yeah, default example is ok.

Apr 25 '23 10:04 Tengxu-Sun

It seems to be an ongoing unresolved issue mentioned here. You might want to take some reference from it.

I have seen some related issues inside this link, but there is no solution.

Apr 25 '23 10:04 Tengxu-Sun

@kurisusnowdeng @JThh @binmakeswell Hi, I have reopened this issue since it still stuck me. Please have a look.

May 05 '23 12:05 Tengxu-Sun

ColossalAI ColossalAI copied to clipboard

[BUG]: stable diffusion 2.0 finetune error

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard