Provide more information to the user
- remove abbreviations and use "longer names" in logs
- enable progress bar for evaluation by default
- enable asynchronized env by default
- add readme intro for resuming training
Perhaps you can also print the config at the beginning of training/eval.
Currently you are using hydra and different config files may override each other. Sometimes a user may not remember the setting in another config file, or not know his/her config is override by another config file.
Simply printing all configs at the beginning of a training/eval process can solve this problem, like what they did for mmengine.
@Cadene:
Re progressbar: you are right, I will not make it as an option. I still suggest to enable both progress bars (i.e., the progress bar for episodes and the bar for steps in an episode). Users can easily locate problems of evaluation if a step takes too long, or there are too many episodes
As a side note, now it logs this at the beginning of training, which is very easy to read:
INFO 2024-08-14 13:15:24 <stdin>:1 {'dataset_repo_id': 'lerobot/pusht',
'device': 'cuda',
'env': {'action_dim': 2,
'episode_length': 300,
'fps': '${fps}',
'gym': {'obs_type': 'pixels_agent_pos',
'render_mode': 'rgb_array',
'visualization_height': 384,
'visualization_width': 384},
'image_size': 96,
'name': 'pusht',
'state_dim': 2,
'task': 'PushT-v0'},
'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': True},
'fps': 10,
'override_dataset_stats': {'action': {'max': [511.0, 511.0],
'min': [12.0, 25.0]},
'observation.image': {'mean': [[[0.5]],
[[0.5]],
[[0.5]]],
'std': [[[0.5]],
[[0.5]],
[[0.5]]]},
'observation.state': {'max': [496.14618, 510.9579],
'min': [13.456424,
32.938293]}},
'policy': {'beta_end': 0.02,
'beta_schedule': 'squaredcos_cap_v2',
'beta_start': 0.0001,
'clip_sample': True,
'clip_sample_range': 1.0,
'crop_is_random': True,
'crop_shape': [84, 84],
'diffusion_step_embed_dim': 128,
'do_mask_loss_for_padding': False,
'down_dims': [512, 1024, 2048],
'horizon': 16,
'input_normalization_modes': {'observation.image': 'mean_std',
'observation.state': 'min_max'},
'input_shapes': {'observation.image': [3, 96, 96],
'observation.state': ['${env.state_dim}']},
'kernel_size': 5,
'n_action_steps': 8,
'n_groups': 8,
'n_obs_steps': 2,
'name': 'diffusion',
'noise_scheduler_type': 'DDPM',
'num_inference_steps': None,
'num_train_timesteps': 100,
'output_normalization_modes': {'action': 'min_max'},
'output_shapes': {'action': ['${env.action_dim}']},
'prediction_type': 'epsilon',
'pretrained_backbone_weights': None,
'spatial_softmax_num_keypoints': 32,
'use_film_scale_modulation': True,
'use_group_norm': True,
'vision_backbone': 'resnet18'},
'resume': False,
'seed': 100000,
'training': {'adam_betas': [0.95, 0.999],
'adam_eps': 1e-08,
'adam_weight_decay': 1e-06,
'batch_size': 64,
'delta_timestamps': {'action': '[i / ${fps} for i in range(1 - '
'${policy.n_obs_steps}, 1 - '
'${policy.n_obs_steps} + '
'${policy.horizon})]',
'observation.image': '[i / ${fps} for i in '
'range(1 - '
'${policy.n_obs_steps}, '
'1)]',
'observation.state': '[i / ${fps} for i in '
'range(1 - '
'${policy.n_obs_steps}, '
'1)]'},
'do_online_rollout_async': False,
'drop_n_last_frames': 7,
'eval_freq': 100,
'grad_clip_norm': 10,
'image_transforms': {'brightness': {'min_max': [0.8, 1.2],
'weight': 1},
'contrast': {'min_max': [0.8, 1.2],
'weight': 1},
'enable': False,
'hue': {'min_max': [-0.05, 0.05],
'weight': 1},
'max_num_transforms': 3,
'random_order': False,
'saturation': {'min_max': [0.5, 1.5],
'weight': 1},
'sharpness': {'min_max': [0.8, 1.2],
'weight': 1}},
'log_freq': 200,
'lr': 0.0001,
'lr_scheduler': 'cosine',
'lr_warmup_steps': 500,
'num_workers': 4,
'offline_steps': 200000,
'online_buffer_capacity': None,
'online_buffer_seed_size': 0,
'online_env_seed': None,
'online_rollout_batch_size': 1,
'online_rollout_n_episodes': 1,
'online_sampling_ratio': 0.5,
'online_steps': 0,
'online_steps_between_rollouts': 1,
'save_checkpoint': True,
'save_freq': 100},
'use_amp': False,
'video_backend': 'pyav',
'wandb': {'disable_artifact': False,
'enable': False,
'notes': '',
'project': 'lerobot'}}
Thanks for revising this @StarCycle . My status is now "approving". I will also wait on @Cadene to approve as he has become involved.
Btw, looks like style tests are not passing. Have you seen CONTRIBUTING.md for instructions on how to set up the pre-commit hook?
@StarCycle By any chance, could you provide code to try this PR?
I feel like at least the third section is missing from the PR description among the sections we advise to add:
- What the PR adds:
- How it was tested
- How to checkout & try? (for the reviewer) <--- example code
See this PR description for instance: https://github.com/huggingface/lerobot/pull/281
Thanks!
@Cadene
You are right! I explain it here:
What this does
- Enable progress bar for evaluation by default, except in slurm
- Enable asynchronized env by default, except for aloha environments.
- Add readme intro for resuming training.
- Add tutorial intro about explainations of abbreviations of the metrics in log
- It will print all the configuration at the beginning of a training process
How it was tested?
Not too much difference from the original code, just run python lerobot/scripts/train.py policy=diffusion env=pusht
How to checkout and try?
Just run python lerobot/scripts/train.py policy=diffusion env=pusht
Just out of curiosity, does LeRobot support multi-gpu training now? (you just mentioned slurm ^^
@StarCycle Yes we are working on a PR using accelerate: https://github.com/huggingface/lerobot/pull/317
@StarCycle Yes we are working on a PR using
accelerate: #317
Nice!
Hi @Cadene,
Are there other things that I need to complete to merge this PR?
(ノ"◑ڡ◑)ノ