[BUG]: why duplicate PID appears on rank 0
Is there an existing issue for this bug?
- [X] I have searched the existing issues
🐛 Describe the bug
When using the GeminiPlugin to train a model, it runs normally at the start. However, once a checkpoint(shard) is saved, a duplicate PID appears on rank 0.
Start:
After Saved Checkpoint
Why and how to avoid it ? Thanks a lot
Environment
Torch: 2.1.2 Colossalai: 0.4.2 Python: 3.8 Cuda: 12.1.0
While dig into, I found that when saving the optimizer, the PIDs from other ranks appear on rank 0.
torch.cuda.empty_cache()
booster.save_optimizer(optimizer, path_optimizer, shard=True, size_per_shard=2048)
I observed that, following this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L525, the PID for other ranks starts appearing on rank-0
Furthermore, after reaching this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L593
If device is replaced with torch.device(f"cuda:{torch.cuda.current_device()}"), each rank retains only one PID, just as at the start.
compacted_states = torch.zeros(
compacted_size,
dtype=dtype,
device=torch.device(f"cuda:{torch.cuda.current_device()}"),
requires_grad=False
)
And after reaching this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L532
the PID for other ranks still starts appearing on each rank.
any Colossalai-ers could help me? Thanks a lot.