ColossalAI [BUG]: weird stuck while training

Is there an existing issue for this bug?

[X] I have searched the existing issues

🐛 Describe the bug

When training a language model with the GeminiPlugin, I encountered an issue where the process got stuck during the forward step. I was saving a checkpoint every 3000 steps, and when it got stuck, I had to kill the process and resume from the latest checkpoint.

The stuck times

start step	stuck step	total step in each run
225000	271464	46464
180000	226463	46463
135000	181463	46463
90000	136463	46463
45000	91463	46463
0	46465	46465

Is there any idea to find why? Thanks a lot.

Environment

CUDA: 12.1 NCCL: 2.18 Pytorch: 2.1.2 Python: 3.8 Colossalai: 0.4.2

Oct 19 '24 05:10 airlsyn

Can you share any relevant messages and stack trace on stuck or exit?

Oct 21 '24 16:10 Edenzzzz

Can you share any relevant messages and stack trace on stuck or exit?

I didn’t receive any useful information or logs. All nodes seem to be functioning correctly. The only option I have is to kill the training process and resume it.

When I add more logs, the process gets stuck at the forward step.

Oct 22 '24 05:10 airlsyn

Could you share the stack trace when you kill by ctrl c and a reproducible script?

Oct 22 '24 17:10 Edenzzzz

Could you share the stack trace when you kill by ctrl c and a reproducible script?

Could it caused by the weird behavior described in https://github.com/hpcaitech/ColossalAI/issues/6111 ?

Nov 04 '24 00:11 airlsyn

You can probably test the behavior of all_gather_object, see if it spawns multiple processes. What happens with booster.save_optimizer(optimizer, path_optimizer, shard=True, size_per_shard=2048) is that it calls into save_sharded_optimizer, which all_gathers the states . You can try removing some barriers along this call stack and ping other members with your findings (whether it fixes the stuck).

Nov 04 '24 01:11 Edenzzzz

I observed that, following this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L525, the PID for other ranks starts appearing on rank-0

Furthermore, after reaching this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L593

If device is replaced with torch.device(f"cuda:{torch.cuda.current_device()}"), each rank retains only one PID, just as at the start.

compacted_states = torch.zeros(
    compacted_size,
    dtype=dtype,
    device=torch.device(f"cuda:{torch.cuda.current_device()}"),
    requires_grad=False
)

And after reaching this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L532

the PID for other ranks still starts appearing on each rank.

Nov 04 '24 04:11 airlsyn

Hi @ver217，could you take a look? Thanks very much.

Nov 04 '24 04:11 airlsyn

And after reaching this line:

https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L532

the PID for other ranks still starts appearing on each rank.

This might just be the default behavior. All gather by definition collects tensor-based objects from other ranks. https://discuss.pytorch.org/t/distributed-all-gather-object-produces-multiple-additional-processes/164991 For the stuck, please try removing dist.barrier call

Nov 06 '24 17:11 Edenzzzz