[BUG]: weird stuck while training
Is there an existing issue for this bug?
- [X] I have searched the existing issues
🐛 Describe the bug
When training a language model with the GeminiPlugin, I encountered an issue where the process got stuck during the forward step. I was saving a checkpoint every 3000 steps, and when it got stuck, I had to kill the process and resume from the latest checkpoint.
The stuck times
| start step | stuck step | total step in each run |
|---|---|---|
| 225000 | 271464 | 46464 |
| 180000 | 226463 | 46463 |
| 135000 | 181463 | 46463 |
| 90000 | 136463 | 46463 |
| 45000 | 91463 | 46463 |
| 0 | 46465 | 46465 |
Is there any idea to find why? Thanks a lot.
Environment
CUDA: 12.1 NCCL: 2.18 Pytorch: 2.1.2 Python: 3.8 Colossalai: 0.4.2
Can you share any relevant messages and stack trace on stuck or exit?
Can you share any relevant messages and stack trace on stuck or exit?
I didn’t receive any useful information or logs. All nodes seem to be functioning correctly. The only option I have is to kill the training process and resume it.
When I add more logs, the process gets stuck at the forward step.
Could you share the stack trace when you kill by ctrl c and a reproducible script?
Could you share the stack trace when you kill by ctrl c and a reproducible script?
Could it caused by the weird behavior described in https://github.com/hpcaitech/ColossalAI/issues/6111 ?
You can probably test the behavior of all_gather_object, see if it spawns multiple processes.
What happens with booster.save_optimizer(optimizer, path_optimizer, shard=True, size_per_shard=2048) is that it calls into save_sharded_optimizer, which all_gathers the states . You can try removing some barriers along this call stack and ping other members with your findings (whether it fixes the stuck).
I observed that, following this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L525, the PID for other ranks starts appearing on rank-0
Furthermore, after reaching this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L593
If device is replaced with torch.device(f"cuda:{torch.cuda.current_device()}"), each rank retains only one PID, just as at the start.
compacted_states = torch.zeros(
compacted_size,
dtype=dtype,
device=torch.device(f"cuda:{torch.cuda.current_device()}"),
requires_grad=False
)
And after reaching this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L532
the PID for other ranks still starts appearing on each rank.
Hi @ver217,could you take a look? Thanks very much.
And after reaching this line:
https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L532
the PID for other ranks still starts appearing on each rank.
This might just be the default behavior. All gather by definition collects tensor-based objects from other ranks. https://discuss.pytorch.org/t/distributed-all-gather-object-produces-multiple-additional-processes/164991 For the stuck, please try removing dist.barrier call