DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Step 2 exited with non-zero status 2

Open awelldone opened this issue 2 years ago • 1 comments

in step2, how to slove this question? image @codedecde

awelldone avatar Apr 22 '23 08:04 awelldone

Hi, can you provide the details from the training.log (see the Log output:) from your screenshot

yaozhewei avatar Apr 24 '23 03:04 yaozhewei

I run the following cmd: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node, then the process stopped at the begining of step 2. Can I just rerun the shell script of step2 without rerunning step 1 first?

blizzardwj avatar Apr 25 '23 04:04 blizzardwj

Hi @awelldone,

Thank you for bringing this issue to our attention. There could be a variety of reasons causing this problem. If you could provide us with the error log (training.log), it would greatly assist us in identifying the issue.

In the meantime, you can try running the shell script for step 2 without having to rerun step 1, as long as the model checkpoint from step 1 has already been saved. Both checkpoints from step 1 and step 2 are utilized in the training for step 3.

Kindly let us know if this helps to address your issue.

Best, Minjia

minjiaz avatar May 04 '23 16:05 minjiaz

Looks like the issue has been resolved.

minjiaz avatar May 12 '23 16:05 minjiaz