Step 2 exited with non-zero status 2
in step2, how to slove this question?
@codedecde
Hi, can you provide the details from the training.log (see the Log output:) from your screenshot
I run the following cmd: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node, then the process stopped at the begining of step 2. Can I just rerun the shell script of step2 without rerunning step 1 first?
Hi @awelldone,
Thank you for bringing this issue to our attention. There could be a variety of reasons causing this problem. If you could provide us with the error log (training.log), it would greatly assist us in identifying the issue.
In the meantime, you can try running the shell script for step 2 without having to rerun step 1, as long as the model checkpoint from step 1 has already been saved. Both checkpoints from step 1 and step 2 are utilized in the training for step 3.
Kindly let us know if this helps to address your issue.
Best, Minjia
Looks like the issue has been resolved.