[BUG]: Hybrid Parallel Plugin，zero_stage=1，zero_cpu_offload=true，terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call

Open twenty-one-boy opened this issue 8 months ago • 1 comments

Is there an existing issue for this bug?

[x] I have searched the existing issues

The bug has not been fixed in the latest main branch

[x] I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

Yes, I will share a minimal reproducible script.

🐛 Describe the bug

colossal=0.4.9，Hybrid Parallel Plugin，zero_stage=1，zero_cpu_offload=true，在八张A100显卡上训练QWQ32B，当pp=2，tp=4时程序正常运行，但GPU显存占用很少，80G的显卡只占用了20G，而CPU内存占用较大，占满了服务器CPU内存，增大max_length后报错：terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call so the stacktrace bellow might be incorrect，如何增大GPU显存占用、减小CPU内存占用？

Environment

No response

Apr 16 '25 15:04 twenty-one-boy

Hi @happynaruto, will it be possible to provide the script you ran?

Assuming that you are doing some fine-tuning, I tested using examples/language/llama/benchmark.py by adding config

"qwen": Qwen2Config(
    hidden_act="silu",
    hidden_size=5120,
    intermediate_size=27648,
    max_position_embeddings=40960,
    max_window_layers=4,
    num_attention_heads=40,
    num_hidden_layers=64,
    num_key_value_heads=8,
    sliding_window=32768,
    use_sliding_window=False,
), # from https://huggingface.co/Qwen/QwQ-32B/blob/main/config.json

When testing with 16384 seqlen:

colossalai run --nproc_per_node 8 examples/language/llama/benchmark.py -c qwen -p 3d --pp 2 --tp 4 --zero 0 -g -x -b 16 -l 16384

Booster init max device memory: 24270.23 MB
...
Max device memory usage: 77474.94 MB

Aug 13 '25 03:08 botbw