ColossalAI [BUG]: why duplicate PID appears on rank 0

Is there an existing issue for this bug?

[X] I have searched the existing issues

🐛 Describe the bug

When using the GeminiPlugin to train a model, it runs normally at the start. However, once a checkpoint（shard） is saved, a duplicate PID appears on rank 0.

Start:

After Saved Checkpoint

Why and how to avoid it ? Thanks a lot

Environment

Torch: 2.1.2 Colossalai: 0.4.2 Python: 3.8 Cuda: 12.1.0

Nov 03 '24 01:11 airlsyn

While dig into, I found that when saving the optimizer, the PIDs from other ranks appear on rank 0.

torch.cuda.empty_cache()
booster.save_optimizer(optimizer, path_optimizer, shard=True, size_per_shard=2048)

Nov 03 '24 07:11 airlsyn

I observed that, following this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L525, the PID for other ranks starts appearing on rank-0

Furthermore, after reaching this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L593

If device is replaced with torch.device(f"cuda:{torch.cuda.current_device()}"), each rank retains only one PID, just as at the start.

compacted_states = torch.zeros(
    compacted_size,
    dtype=dtype,
    device=torch.device(f"cuda:{torch.cuda.current_device()}"),
    requires_grad=False
)

And after reaching this line: https://github.com/hpcaitech/ColossalAI/blob/2f583c154944b1f00d7a4fb1ce529db9d8595056/colossalai/zero/gemini/gemini_optimizer.py#L532

the PID for other ranks still starts appearing on each rank.

Nov 04 '24 04:11 airlsyn

any Colossalai-ers could help me? Thanks a lot.

Nov 04 '24 04:11 airlsyn