lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

Provide more information to the user

Open StarCycle opened this issue 1 year ago • 10 comments

  1. remove abbreviations and use "longer names" in logs
  2. enable progress bar for evaluation by default
  3. enable asynchronized env by default
  4. add readme intro for resuming training

StarCycle avatar Aug 12 '24 15:08 StarCycle

Perhaps you can also print the config at the beginning of training/eval.

Currently you are using hydra and different config files may override each other. Sometimes a user may not remember the setting in another config file, or not know his/her config is override by another config file.

Simply printing all configs at the beginning of a training/eval process can solve this problem, like what they did for mmengine.

StarCycle avatar Aug 13 '24 04:08 StarCycle

@Cadene:

Re progressbar: you are right, I will not make it as an option. I still suggest to enable both progress bars (i.e., the progress bar for episodes and the bar for steps in an episode). Users can easily locate problems of evaluation if a step takes too long, or there are too many episodes

StarCycle avatar Aug 14 '24 10:08 StarCycle

As a side note, now it logs this at the beginning of training, which is very easy to read:

INFO 2024-08-14 13:15:24       <stdin>:1 {'dataset_repo_id': 'lerobot/pusht',
 'device': 'cuda',
 'env': {'action_dim': 2,
         'episode_length': 300,
         'fps': '${fps}',
         'gym': {'obs_type': 'pixels_agent_pos',
                 'render_mode': 'rgb_array',
                 'visualization_height': 384,
                 'visualization_width': 384},
         'image_size': 96,
         'name': 'pusht',
         'state_dim': 2,
         'task': 'PushT-v0'},
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': True},
 'fps': 10,
 'override_dataset_stats': {'action': {'max': [511.0, 511.0],
                                       'min': [12.0, 25.0]},
                            'observation.image': {'mean': [[[0.5]],
                                                           [[0.5]],
                                                           [[0.5]]],
                                                  'std': [[[0.5]],
                                                          [[0.5]],
                                                          [[0.5]]]},
                            'observation.state': {'max': [496.14618, 510.9579],
                                                  'min': [13.456424,
                                                          32.938293]}},
 'policy': {'beta_end': 0.02,
            'beta_schedule': 'squaredcos_cap_v2',
            'beta_start': 0.0001,
            'clip_sample': True,
            'clip_sample_range': 1.0,
            'crop_is_random': True,
            'crop_shape': [84, 84],
            'diffusion_step_embed_dim': 128,
            'do_mask_loss_for_padding': False,
            'down_dims': [512, 1024, 2048],
            'horizon': 16,
            'input_normalization_modes': {'observation.image': 'mean_std',
                                          'observation.state': 'min_max'},
            'input_shapes': {'observation.image': [3, 96, 96],
                             'observation.state': ['${env.state_dim}']},
            'kernel_size': 5,
            'n_action_steps': 8,
            'n_groups': 8,
            'n_obs_steps': 2,
            'name': 'diffusion',
            'noise_scheduler_type': 'DDPM',
            'num_inference_steps': None,
            'num_train_timesteps': 100,
            'output_normalization_modes': {'action': 'min_max'},
            'output_shapes': {'action': ['${env.action_dim}']},
            'prediction_type': 'epsilon',
            'pretrained_backbone_weights': None,
            'spatial_softmax_num_keypoints': 32,
            'use_film_scale_modulation': True,
            'use_group_norm': True,
            'vision_backbone': 'resnet18'},
 'resume': False,
 'seed': 100000,
 'training': {'adam_betas': [0.95, 0.999],
              'adam_eps': 1e-08,
              'adam_weight_decay': 1e-06,
              'batch_size': 64,
              'delta_timestamps': {'action': '[i / ${fps} for i in range(1 - '
                                             '${policy.n_obs_steps}, 1 - '
                                             '${policy.n_obs_steps} + '
                                             '${policy.horizon})]',
                                   'observation.image': '[i / ${fps} for i in '
                                                        'range(1 - '
                                                        '${policy.n_obs_steps}, '
                                                        '1)]',
                                   'observation.state': '[i / ${fps} for i in '
                                                        'range(1 - '
                                                        '${policy.n_obs_steps}, '
                                                        '1)]'},
              'do_online_rollout_async': False,
              'drop_n_last_frames': 7,
              'eval_freq': 100,
              'grad_clip_norm': 10,
              'image_transforms': {'brightness': {'min_max': [0.8, 1.2],
                                                  'weight': 1},
                                   'contrast': {'min_max': [0.8, 1.2],
                                                'weight': 1},
                                   'enable': False,
                                   'hue': {'min_max': [-0.05, 0.05],
                                           'weight': 1},
                                   'max_num_transforms': 3,
                                   'random_order': False,
                                   'saturation': {'min_max': [0.5, 1.5],
                                                  'weight': 1},
                                   'sharpness': {'min_max': [0.8, 1.2],
                                                 'weight': 1}},
              'log_freq': 200,
              'lr': 0.0001,
              'lr_scheduler': 'cosine',
              'lr_warmup_steps': 500,
              'num_workers': 4,
              'offline_steps': 200000,
              'online_buffer_capacity': None,
              'online_buffer_seed_size': 0,
              'online_env_seed': None,
              'online_rollout_batch_size': 1,
              'online_rollout_n_episodes': 1,
              'online_sampling_ratio': 0.5,
              'online_steps': 0,
              'online_steps_between_rollouts': 1,
              'save_checkpoint': True,
              'save_freq': 100},
 'use_amp': False,
 'video_backend': 'pyav',
 'wandb': {'disable_artifact': False,
           'enable': False,
           'notes': '',
           'project': 'lerobot'}}

StarCycle avatar Aug 14 '24 13:08 StarCycle

Thanks for revising this @StarCycle . My status is now "approving". I will also wait on @Cadene to approve as he has become involved.

Btw, looks like style tests are not passing. Have you seen CONTRIBUTING.md for instructions on how to set up the pre-commit hook?

alexander-soare avatar Aug 14 '24 13:08 alexander-soare

@StarCycle By any chance, could you provide code to try this PR?

I feel like at least the third section is missing from the PR description among the sections we advise to add:

  1. What the PR adds:
  2. How it was tested
  3. How to checkout & try? (for the reviewer) <--- example code

See this PR description for instance: https://github.com/huggingface/lerobot/pull/281

Thanks!

Cadene avatar Aug 15 '24 18:08 Cadene

@Cadene

You are right! I explain it here:

What this does

  1. Enable progress bar for evaluation by default, except in slurm
  2. Enable asynchronized env by default, except for aloha environments.
  3. Add readme intro for resuming training.
  4. Add tutorial intro about explainations of abbreviations of the metrics in log
  5. It will print all the configuration at the beginning of a training process

How it was tested?

Not too much difference from the original code, just run python lerobot/scripts/train.py policy=diffusion env=pusht

How to checkout and try?

Just run python lerobot/scripts/train.py policy=diffusion env=pusht

StarCycle avatar Aug 16 '24 03:08 StarCycle

Just out of curiosity, does LeRobot support multi-gpu training now? (you just mentioned slurm ^^

StarCycle avatar Aug 17 '24 07:08 StarCycle

@StarCycle Yes we are working on a PR using accelerate: https://github.com/huggingface/lerobot/pull/317

Cadene avatar Aug 17 '24 13:08 Cadene

@StarCycle Yes we are working on a PR using accelerate: #317

Nice!

StarCycle avatar Aug 17 '24 14:08 StarCycle

Hi @Cadene,

Are there other things that I need to complete to merge this PR?

(ノ"◑ڡ◑)ノ

StarCycle avatar Aug 21 '24 02:08 StarCycle