ColossalAI-Examples icon indicating copy to clipboard operation
ColossalAI-Examples copied to clipboard

Too large training loss

Open qyc-98 opened this issue 3 years ago • 1 comments
trafficstars

🐛 Describe the bug

Hi

I'm training bert using sequence parallel in colossal ai according to this link. But my training loss is too large, and it seems the training loss grows linearly with the number of sequence parallel sizes.

when my setting is: parallel = dict(pipeline=1, tensor=dict(size=8, mode='sequence')) the training loss in the beginning was and after 2330 steps the training loss is 13.044

when my setting is: parallel = dict(pipeline=1, tensor=dict(size=2, mode='sequence')) after 2330 steps the training loss is 13.044

when my setting is: parallel = dict(pipeline=1, tensor=dict(size=1, mode='sequence')) after 2330 steps the training loss is 6.5549

Environment

after running colossalai check -i I got image

my device is 8 rtx3090 training batch is 128 across three sequence parallel settings.

my training config is

image

Thanks!

qyc-98 avatar Jul 13 '22 08:07 qyc-98

Hi @qyc-98 Thank you for your feedback. We will try to reproduce your issue.

By the way, we are restructuring the documents and examples, and the new version examples will be provided at the following link https://github.com/hpcaitech/ColossalAI/tree/main/examples

binmakeswell avatar Nov 15 '22 05:11 binmakeswell