DeepSpeedExamples Step 2 exited with non-zero status 2

in step2， how to slove this question? @codedecde

Apr 22 '23 08:04 awelldone

Hi, can you provide the details from the training.log (see the Log output:) from your screenshot

Apr 24 '23 03:04 yaozhewei

I run the following cmd: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node, then the process stopped at the begining of step 2. Can I just rerun the shell script of step2 without rerunning step 1 first?

Apr 25 '23 04:04 blizzardwj

Hi @awelldone,

Thank you for bringing this issue to our attention. There could be a variety of reasons causing this problem. If you could provide us with the error log (training.log), it would greatly assist us in identifying the issue.

In the meantime, you can try running the shell script for step 2 without having to rerun step 1, as long as the model checkpoint from step 1 has already been saved. Both checkpoints from step 1 and step 2 are utilized in the training for step 3.

Kindly let us know if this helps to address your issue.

Best, Minjia

May 04 '23 16:05 minjiaz

Looks like the issue has been resolved.

May 12 '23 16:05 minjiaz