lerobot Provide more information to the user

remove abbreviations and use "longer names" in logs
enable progress bar for evaluation by default
enable asynchronized env by default
add readme intro for resuming training

Aug 12 '24 15:08 StarCycle

Perhaps you can also print the config at the beginning of training/eval.

Currently you are using hydra and different config files may override each other. Sometimes a user may not remember the setting in another config file, or not know his/her config is override by another config file.

Simply printing all configs at the beginning of a training/eval process can solve this problem, like what they did for mmengine.

Aug 13 '24 04:08 StarCycle

@Cadene:

Re progressbar: you are right, I will not make it as an option. I still suggest to enable both progress bars (i.e., the progress bar for episodes and the bar for steps in an episode). Users can easily locate problems of evaluation if a step takes too long, or there are too many episodes

Aug 14 '24 10:08 StarCycle

As a side note, now it logs this at the beginning of training, which is very easy to read:

INFO 2024-08-14 13:15:24       <stdin>:1 {'dataset_repo_id': 'lerobot/pusht',
 'device': 'cuda',
 'env': {'action_dim': 2,
         'episode_length': 300,
         'fps': '${fps}',
         'gym': {'obs_type': 'pixels_agent_pos',
                 'render_mode': 'rgb_array',
                 'visualization_height': 384,
                 'visualization_width': 384},
         'image_size': 96,
         'name': 'pusht',
         'state_dim': 2,
         'task': 'PushT-v0'},
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': True},
 'fps': 10,
 'override_dataset_stats': {'action': {'max': [511.0, 511.0],
                                       'min': [12.0, 25.0]},
                            'observation.image': {'mean': [[[0.5]],
                                                           [[0.5]],
                                                           [[0.5]]],
                                                  'std': [[[0.5]],
                                                          [[0.5]],
                                                          [[0.5]]]},
                            'observation.state': {'max': [496.14618, 510.9579],
                                                  'min': [13.456424,
                                                          32.938293]}},
 'policy': {'beta_end': 0.02,
            'beta_schedule': 'squaredcos_cap_v2',
            'beta_start': 0.0001,
            'clip_sample': True,
            'clip_sample_range': 1.0,
            'crop_is_random': True,
            'crop_shape': [84, 84],
            'diffusion_step_embed_dim': 128,
            'do_mask_loss_for_padding': False,
            'down_dims': [512, 1024, 2048],
            'horizon': 16,
            'input_normalization_modes': {'observation.image': 'mean_std',
                                          'observation.state': 'min_max'},
            'input_shapes': {'observation.image': [3, 96, 96],
                             'observation.state': ['${env.state_dim}']},
            'kernel_size': 5,
            'n_action_steps': 8,
            'n_groups': 8,
            'n_obs_steps': 2,
            'name': 'diffusion',
            'noise_scheduler_type': 'DDPM',
            'num_inference_steps': None,
            'num_train_timesteps': 100,
            'output_normalization_modes': {'action': 'min_max'},
            'output_shapes': {'action': ['${env.action_dim}']},
            'prediction_type': 'epsilon',
            'pretrained_backbone_weights': None,
            'spatial_softmax_num_keypoints': 32,
            'use_film_scale_modulation': True,
            'use_group_norm': True,
            'vision_backbone': 'resnet18'},
 'resume': False,
 'seed': 100000,
 'training': {'adam_betas': [0.95, 0.999],
              'adam_eps': 1e-08,
              'adam_weight_decay': 1e-06,
              'batch_size': 64,
              'delta_timestamps': {'action': '[i / ${fps} for i in range(1 - '
                                             '${policy.n_obs_steps}, 1 - '
                                             '${policy.n_obs_steps} + '
                                             '${policy.horizon})]',
                                   'observation.image': '[i / ${fps} for i in '
                                                        'range(1 - '
                                                        '${policy.n_obs_steps}, '
                                                        '1)]',
                                   'observation.state': '[i / ${fps} for i in '
                                                        'range(1 - '
                                                        '${policy.n_obs_steps}, '
                                                        '1)]'},
              'do_online_rollout_async': False,
              'drop_n_last_frames': 7,
              'eval_freq': 100,
              'grad_clip_norm': 10,
              'image_transforms': {'brightness': {'min_max': [0.8, 1.2],
                                                  'weight': 1},
                                   'contrast': {'min_max': [0.8, 1.2],
                                                'weight': 1},
                                   'enable': False,
                                   'hue': {'min_max': [-0.05, 0.05],
                                           'weight': 1},
                                   'max_num_transforms': 3,
                                   'random_order': False,
                                   'saturation': {'min_max': [0.5, 1.5],
                                                  'weight': 1},
                                   'sharpness': {'min_max': [0.8, 1.2],
                                                 'weight': 1}},
              'log_freq': 200,
              'lr': 0.0001,
              'lr_scheduler': 'cosine',
              'lr_warmup_steps': 500,
              'num_workers': 4,
              'offline_steps': 200000,
              'online_buffer_capacity': None,
              'online_buffer_seed_size': 0,
              'online_env_seed': None,
              'online_rollout_batch_size': 1,
              'online_rollout_n_episodes': 1,
              'online_sampling_ratio': 0.5,
              'online_steps': 0,
              'online_steps_between_rollouts': 1,
              'save_checkpoint': True,
              'save_freq': 100},
 'use_amp': False,
 'video_backend': 'pyav',
 'wandb': {'disable_artifact': False,
           'enable': False,
           'notes': '',
           'project': 'lerobot'}}

Aug 14 '24 13:08 StarCycle

Thanks for revising this @StarCycle . My status is now "approving". I will also wait on @Cadene to approve as he has become involved.

Btw, looks like style tests are not passing. Have you seen CONTRIBUTING.md for instructions on how to set up the pre-commit hook?

Aug 14 '24 13:08 alexander-soare

@StarCycle By any chance, could you provide code to try this PR?

I feel like at least the third section is missing from the PR description among the sections we advise to add:

What the PR adds:
How it was tested
How to checkout & try? (for the reviewer) <--- example code

See this PR description for instance: https://github.com/huggingface/lerobot/pull/281

Thanks!

Aug 15 '24 18:08 Cadene

@Cadene

You are right! I explain it here:

What this does

Enable progress bar for evaluation by default, except in slurm
Enable asynchronized env by default, except for aloha environments.
Add readme intro for resuming training.
Add tutorial intro about explainations of abbreviations of the metrics in log
It will print all the configuration at the beginning of a training process

How it was tested?

Not too much difference from the original code, just run python lerobot/scripts/train.py policy=diffusion env=pusht

How to checkout and try?

Just run python lerobot/scripts/train.py policy=diffusion env=pusht

Aug 16 '24 03:08 StarCycle

Just out of curiosity, does LeRobot support multi-gpu training now? (you just mentioned slurm ^^

Aug 17 '24 07:08 StarCycle

@StarCycle Yes we are working on a PR using accelerate: https://github.com/huggingface/lerobot/pull/317

Aug 17 '24 13:08 Cadene

@StarCycle Yes we are working on a PR using accelerate: #317

Nice!

Aug 17 '24 14:08 StarCycle

Hi @Cadene,

Are there other things that I need to complete to merge this PR?

(ﾉ"◑ڡ◑)ﾉ

Aug 21 '24 02:08 StarCycle