[BUG]: Hybrid Parallel Plugin,zero_stage=1,zero_cpu_offload=true,terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call
Is there an existing issue for this bug?
- [x] I have searched the existing issues
The bug has not been fixed in the latest main branch
- [x] I have checked the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
colossal=0.4.9,Hybrid Parallel Plugin,zero_stage=1,zero_cpu_offload=true,在八张A100显卡上训练QWQ32B,当pp=2,tp=4时程序正常运行,但GPU显存占用很少,80G的显卡只占用了20G,而CPU内存占用较大,占满了服务器CPU内存,增大max_length后报错:terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call so the stacktrace bellow might be incorrect,如何增大GPU显存占用、减小CPU内存占用?
Environment
No response
Hi @happynaruto, will it be possible to provide the script you ran?
Assuming that you are doing some fine-tuning, I tested using examples/language/llama/benchmark.py by adding config
"qwen": Qwen2Config(
hidden_act="silu",
hidden_size=5120,
intermediate_size=27648,
max_position_embeddings=40960,
max_window_layers=4,
num_attention_heads=40,
num_hidden_layers=64,
num_key_value_heads=8,
sliding_window=32768,
use_sliding_window=False,
), # from https://huggingface.co/Qwen/QwQ-32B/blob/main/config.json
When testing with 16384 seqlen:
colossalai run --nproc_per_node 8 examples/language/llama/benchmark.py -c qwen -p 3d --pp 2 --tp 4 --zero 0 -g -x -b 16 -l 16384
Booster init max device memory: 24270.23 MB
...
Max device memory usage: 77474.94 MB