ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Hybrid Parallel Plugin,zero_stage=1,zero_cpu_offload=true,terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call

Open twenty-one-boy opened this issue 8 months ago • 1 comments

Is there an existing issue for this bug?

  • [x] I have searched the existing issues

The bug has not been fixed in the latest main branch

  • [x] I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

Yes, I will share a minimal reproducible script.

🐛 Describe the bug

colossal=0.4.9,Hybrid Parallel Plugin,zero_stage=1,zero_cpu_offload=true,在八张A100显卡上训练QWQ32B,当pp=2,tp=4时程序正常运行,但GPU显存占用很少,80G的显卡只占用了20G,而CPU内存占用较大,占满了服务器CPU内存,增大max_length后报错:terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call so the stacktrace bellow might be incorrect,如何增大GPU显存占用、减小CPU内存占用?

Environment

No response

twenty-one-boy avatar Apr 16 '25 15:04 twenty-one-boy

Hi @happynaruto, will it be possible to provide the script you ran?

Assuming that you are doing some fine-tuning, I tested using examples/language/llama/benchmark.py by adding config

"qwen": Qwen2Config(
    hidden_act="silu",
    hidden_size=5120,
    intermediate_size=27648,
    max_position_embeddings=40960,
    max_window_layers=4,
    num_attention_heads=40,
    num_hidden_layers=64,
    num_key_value_heads=8,
    sliding_window=32768,
    use_sliding_window=False,
), # from https://huggingface.co/Qwen/QwQ-32B/blob/main/config.json

When testing with 16384 seqlen:

colossalai run --nproc_per_node 8 examples/language/llama/benchmark.py -c qwen -p 3d --pp 2 --tp 4 --zero 0 -g -x -b 16 -l 16384

Booster init max device memory: 24270.23 MB
...
Max device memory usage: 77474.94 MB

botbw avatar Aug 13 '25 03:08 botbw