ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

Questions about log interpretation, seems paradoxical

Open shjwudp opened this issue 2 years ago • 6 comments

This line of log confuses me, my batch size is 513 and iteration time is 98.83, so the throughput should be 5.19. Obviously, the logs of iteration time and throughput are contradictory, could someone tell me how to interpret it? thanks!

image

My training configuration:

# ColossalAI Version: v0.1.3
HIDDEN_SIZE = 2048
BATCH_SIZE = 513
NUM_EPOCHS = 1
SEQ_LEN = 2048
NUM_MICRO_BATCHES = 513
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, HIDDEN_SIZE)

parallel = dict(
    pipeline=3,
    tensor=dict(mode='1d', size=1)
)

shjwudp avatar May 10 '22 04:05 shjwudp

Hi, I believe there is some arithmetic error. Let's investigate into this problem 🔥

FrankLeeeee avatar May 10 '22 08:05 FrankLeeeee

Hi @shjwudp , we use tqdm to show such a progress bar, where 98.83s/it is an average value presented by tqdm, while throughput=41.595 is the value of the latest step presented by ColossalAI. We also provide an average result after each epoch. Is that number also abnormal?

kurisusnowdeng avatar May 10 '22 08:05 kurisusnowdeng

@FrankLeeeee @kurisusnowdeng Thanks for your response, it solved my confusion perfectly! BTW, I found gpt example has excellent scaling efficiency, but not good at computing performance. Under the same hyperparameter configuration and resource usage, DeepSpeed does 3x throughput of colossalai. This blows my mind. Do you have plan to improve the performance of the GPT example? I think a lot of people will be interested in this performance.

shjwudp avatar May 10 '22 15:05 shjwudp

We also provide an average result after each epoch. Is that number also abnormal?

@kurisusnowdeng I haven't run a full epoch, but I have a task that will run an epoch tomorrow, then I'll sync my findings with you :)

shjwudp avatar May 10 '22 15:05 shjwudp

We also provide an average result after each epoch. Is that number also abnormal?

@kurisusnowdeng I haven't run a full epoch, but I have a task that will run an epoch tomorrow, then I'll sync my findings with you :)

@shjwudp We are looking forward to your results.

kurisusnowdeng avatar May 10 '22 16:05 kurisusnowdeng

@kurisusnowdeng The average result for Epoch is 32.005 which is closer to throughput than iteration time.

shjwudp avatar May 11 '22 02:05 shjwudp

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 04:04 binmakeswell