[BUG] ACL stream synchronize failed, error code 507015
model llama3.1-8b zero_stage=1 pipe_parallel_size=2 model_parallel_size=2 world_size = 16 npus
The above erros offen happens during training unpredictable if I use model parallel and zero1
@xuedinge233 & @hipudding
Thanks for this bug report.
@hipudding - anything to add here? Should we leave this open or close this?
@hipudding - let us know if we need to re-open this, otherwise closing for now since its not related to DeepSpeed
@loadams Sorry for late reply, This is something wrong in torch_npu, and I think it's not relate to DeepSpeed. We will look into it. Please keep this issue as closed, Thanks.