rj42
rj42
Fix training crash due to missing checkpoint directory We encountered a training crash with error: "RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist". It appears that `self.actor_rollout_wg.save_checkpoint`, which should create the...
**Problem** We've encountered a race condition when creating checkpoint directories during DaPo training that causes the process to crash with the following error: > RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not...
1) заменяю `=` на `==` в `requirements.txt` 2) включаю режим евала и отключаю градиенты во время скоринга в `training step`