lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

Improve documentation for resuming training runs

Open leonmkim opened this issue 9 months ago • 1 comments

lerobot/examples/4_train_policy_with_script.md guides users to point CLI arg config_path to the directory containing the train_config.json file rather than the json file itself which doesnt work.

Put explicitly,

python lerobot/scripts/train.py \
    --config_path=outputs/train/run_resumption/checkpoints/last/pretrained_model/ \
    --resume=true

is wrong and should be corrected to

python lerobot/scripts/train.py \
    --config_path=outputs/train/run_resumption/checkpoints/last/pretrained_model/train_config.json \
    --resume=true

Furthermore, if the training run to be resumed uses wandb for logging and the user did not initially provide a wandb run id, the user must pass in the corresponding wandb run id as the train_config.json does not save it. The documentation should specify that wandb.run_id should be provided with resumption. Or, more conveniently, the id should be saved somewhere with the checkpoint.

leonmkim avatar Mar 21 '25 19:03 leonmkim

Also, I tried to resume training with a newly collected dataset, it seems like this isn't possible yet? or did I just don something wrong?

Esser50K avatar Mar 21 '25 21:03 Esser50K

lerobot/examples/4_train_policy_with_script.md guides users to point CLI arg config_path to the directory containing the train_config.json file rather than the json file itself which doesnt work.

Put explicitly,

python lerobot/scripts/train.py \
    --config_path=outputs/train/run_resumption/checkpoints/last/pretrained_model/ \
    --resume=true

is wrong and should be corrected to

python lerobot/scripts/train.py \
    --config_path=outputs/train/run_resumption/checkpoints/last/pretrained_model/train_config.json \
    --resume=true

Would also love to see this fixed. I see two options:

  1. if this is desired, update the markdown file
  2. if not, add a check in the code and distinguish between both cases for the pretrained_path

I would be willing to submit a PR for this quick fix if someone from the Lerobot team can clarify which option to take.

tlpss avatar Apr 15 '25 12:04 tlpss

Yes! The current documentation and implementation is very confusing. It would also be great to see clear documentation for both resuming a training run, and starting a new training run by initializing model weights with a pre-trained checkpoint.

gdaddi avatar Sep 17 '25 16:09 gdaddi