DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] ACL stream synchronize failed, error code 507015

Open janelu9 opened this issue 1 year ago • 2 comments

model llama3.1-8b zero_stage=1 pipe_parallel_size=2 model_parallel_size=2 world_size = 16 npus

image

The above erros offen happens during training unpredictable if I use model parallel and zero1

janelu9 avatar Sep 19 '24 04:09 janelu9

@xuedinge233 & @hipudding

jomayeri avatar Oct 07 '24 17:10 jomayeri

Thanks for this bug report.

hipudding avatar Oct 08 '24 08:10 hipudding

@hipudding - anything to add here? Should we leave this open or close this?

loadams avatar Oct 31 '24 17:10 loadams

@hipudding - let us know if we need to re-open this, otherwise closing for now since its not related to DeepSpeed

loadams avatar Nov 01 '24 17:11 loadams

@loadams Sorry for late reply, This is something wrong in torch_npu, and I think it's not relate to DeepSpeed. We will look into it. Please keep this issue as closed, Thanks.

hipudding avatar Nov 04 '24 03:11 hipudding