ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Acc decreases when training with PP and TP

Open Gy-Lu opened this issue 3 years ago • 1 comments

🐛 Describe the bug

When I ran ViT with cifar-10, I found that if using a hybrid of PP and TP, the test ACC decreased with the process of training. However, things went fine if using PP or TP alone.

  • non_interleaved PP and 1d TP CONFIG = dict(NUM_MICRO_BATCHES=2, parallel=dict(pipeline=2, tensor=dict(size=2, mode='1d')))
image
  • interleaved PP(num_chunk=1) and 1d TP
NUM_CHUNKS = 1
CONFIG = dict(NUM_MICRO_BATCHES=2, parallel=dict(pipeline=2, tensor=dict(size=2, mode='1d')), model=dict(num_chunks=NUM_CHUNKS))
image
  • interleaved PP(num_chunks=2) and 2d TP
NUM_CHUNKS = 2
CONFIG = dict(NUM_MICRO_BATCHES=2, parallel=dict(pipeline=2, tensor=dict(size=2, mode='1d')), model=dict(num_chunks=NUM_CHUNKS))
image

Environment

No response

Gy-Lu avatar Apr 24 '22 07:04 Gy-Lu

Oh, I found that this model was defined with torch.nn. :x

Gy-Lu avatar Apr 25 '22 03:04 Gy-Lu

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 04:04 binmakeswell