Improve documentation for resuming training runs
lerobot/examples/4_train_policy_with_script.md guides users to point CLI arg config_path to the directory containing the train_config.json file rather than the json file itself which doesnt work.
Put explicitly,
python lerobot/scripts/train.py \
--config_path=outputs/train/run_resumption/checkpoints/last/pretrained_model/ \
--resume=true
is wrong and should be corrected to
python lerobot/scripts/train.py \
--config_path=outputs/train/run_resumption/checkpoints/last/pretrained_model/train_config.json \
--resume=true
Furthermore, if the training run to be resumed uses wandb for logging and the user did not initially provide a wandb run id, the user must pass in the corresponding wandb run id as the train_config.json does not save it. The documentation should specify that wandb.run_id should be provided with resumption. Or, more conveniently, the id should be saved somewhere with the checkpoint.
Also, I tried to resume training with a newly collected dataset, it seems like this isn't possible yet? or did I just don something wrong?
lerobot/examples/4_train_policy_with_script.mdguides users to point CLI argconfig_pathto the directory containing the train_config.json file rather than the json file itself which doesnt work.Put explicitly,
python lerobot/scripts/train.py \ --config_path=outputs/train/run_resumption/checkpoints/last/pretrained_model/ \ --resume=trueis wrong and should be corrected to
python lerobot/scripts/train.py \ --config_path=outputs/train/run_resumption/checkpoints/last/pretrained_model/train_config.json \ --resume=true
Would also love to see this fixed. I see two options:
- if this is desired, update the markdown file
- if not, add a check in the code and distinguish between both cases for the
pretrained_path
I would be willing to submit a PR for this quick fix if someone from the Lerobot team can clarify which option to take.
Yes! The current documentation and implementation is very confusing. It would also be great to see clear documentation for both resuming a training run, and starting a new training run by initializing model weights with a pre-trained checkpoint.