ColossalAI
ColossalAI copied to clipboard
[BUG]: stable diffusion 2.0 finetune error
🐛 Describe the bug
I adapt the example by replacing export MODEL_NAME="CompVis/stable-diffusion-v1-4" with export MODEL_NAME="stabilityai/stable-diffusion-2", then run the script and got following error.
RuntimeError: false INTERNAL ASSERT FAILED at "../c10/cuda/CUDAGraphsC10Utils.h":73, please report a bug to PyTorch. Unknown CUDA graph CaptureStatus32521
It seems like failure at optimizer.backward(loss) . Is there any help?
Environment
torch==1.13.1 ColossalAI==2.5.0
Any help? Thanks a lot! @kurisusnowdeng @JThh @binmakeswell
Hi, can you try some preliminary fixes by upgrading colossalai to the latest version? pip install colossalai should do.
After that, can you run colossalai check -i and show us the message?
Hi, can you try some preliminary fixes by upgrading
colossalaito the latest version?pip install colossalaishould do.
I have upgrade colossalai to the newest version of 0.2.8. It has the same error.
After that, can you run
colossalai check -iand show us the message?
OK
#### Installation Report ####
------------ Environment ------------
Colossal-AI version: 0.2.8
PyTorch version: 1.13.1
System CUDA version: 11.7
CUDA version required by PyTorch: 11.7
Note:
1. The table above checks the versions of the libraries/tools in the current environment
2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.
------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A
Note:
1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: ✓
System and Colossal-AI CUDA version match: N/A
Note:
1. The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation
It seems to be an ongoing unresolved issue mentioned here. You might want to take some reference from it.
Does it work when you run our default example?
Does it work when you run our default example?
Yeah, default example is ok.
It seems to be an ongoing unresolved issue mentioned here. You might want to take some reference from it.
I have seen some related issues inside this link, but there is no solution.
@kurisusnowdeng @JThh @binmakeswell Hi, I have reopened this issue since it still stuck me. Please have a look.