lerobot Dataloader is blazingly fast for ACT training but VERY slow for SmolVLA and Diffusion Policy training

System Info

- `lerobot` version: 0.1.0
- Platform: Linux-6.8.0-1032-gcp-x86_64-with-glibc2.31
- Python version: 3.10.18
- Huggingface_hub version: 0.33.1
- Dataset version: 3.6.0
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.7.0+cu128 (True)
- Cuda version: 12080
- Using GPU in script?: Yes, H100 GPU

Information

[x] One of the scripts in the examples/ folder of LeRobot
[x] My own task or dataset (give details below)

Reproduction

When I run the command:

python lerobot/lerobot/scripts/train.py \
    --policy.type=diffusion \
    --dataset.repo_id=dopaul/1500_chess_moves \
    --batch_size=64 \
    --steps 100000 \
    --seed=100000 \
    --log_freq 1 \
    --dataset.video_backend=pyav \
    --save_checkpoint true  \
    --save_freq 20_000 \
    --wandb.enable true \
    --wandb.entity 'dominique-paul' \
    --wandb.project chesso \
    --num_workers 12

Then I get the following logs (pasting everything just in case it's useful:

INFO 2025-07-11 08:18:39 ils/utils.py:48 Cuda backend detected, using cuda.
WARNING 2025-07-11 08:18:39 /policies.py:67 Device 'None' is not available. Switching to 'cuda'.
INFO 2025-07-11 08:18:39 ts/train.py:111 {'batch_size': 64,
 'dataset': {'episodes': None,
             'image_transforms': {'enable': False,
                                  'max_num_transforms': 3,
                                  'random_order': False,
                                  'tfs': {'brightness': {'kwargs': {'brightness': [0.8,
                                                                                   1.2]},
                                                         'type': 'ColorJitter',
                                                         'weight': 1.0},
                                          'contrast': {'kwargs': {'contrast': [0.8,
                                                                               1.2]},
                                                       'type': 'ColorJitter',
                                                       'weight': 1.0},
                                          'hue': {'kwargs': {'hue': [-0.05,
                                                                     0.05]},
                                                  'type': 'ColorJitter',
                                                  'weight': 1.0},
                                          'saturation': {'kwargs': {'saturation': [0.5,
                                                                                   1.5]},
                                                         'type': 'ColorJitter',
                                                         'weight': 1.0},
                                          'sharpness': {'kwargs': {'sharpness': [0.5,
                                                                                 1.5]},
                                                        'type': 'SharpnessJitter',
                                                        'weight': 1.0}}},
             'repo_id': 'dopaul/1500_chess_moves',
             'revision': None,
             'root': None,
             'use_imagenet_stats': True,
             'video_backend': 'pyav'},
 'env': None,
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': False},
 'eval_freq': 20000,
 'job_name': 'diffusion',
 'log_freq': 1,
 'num_workers': 8,
 'optimizer': {'betas': [0.95, 0.999],
               'eps': 1e-08,
               'grad_clip_norm': 10.0,
               'lr': 0.0001,
               'type': 'adam',
               'weight_decay': 1e-06},
 'output_dir': 'outputs/train/2025-07-11/08-18-39_diffusion',
 'policy': {'beta_end': 0.02,
            'beta_schedule': 'squaredcos_cap_v2',
            'beta_start': 0.0001,
            'clip_sample': True,
            'clip_sample_range': 1.0,
            'crop_is_random': True,
            'crop_shape': [84, 84],
            'device': 'cuda',
            'diffusion_step_embed_dim': 128,
            'do_mask_loss_for_padding': False,
            'down_dims': [512, 1024, 2048],
            'drop_n_last_frames': 7,
            'horizon': 16,
            'input_features': {},
            'kernel_size': 5,
            'n_action_steps': 8,
            'n_groups': 8,
            'n_obs_steps': 2,
            'noise_scheduler_type': 'DDPM',
            'normalization_mapping': {'ACTION': <NormalizationMode.MIN_MAX: 'MIN_MAX'>,
                                      'STATE': <NormalizationMode.MIN_MAX: 'MIN_MAX'>,
                                      'VISUAL': <NormalizationMode.MEAN_STD: 'MEAN_STD'>},
            'num_inference_steps': None,
            'num_train_timesteps': 100,
            'optimizer_betas': [0.95, 0.999],
            'optimizer_eps': 1e-08,
            'optimizer_lr': 0.0001,
            'optimizer_weight_decay': 1e-06,
            'output_features': {},
            'prediction_type': 'epsilon',
            'pretrained_backbone_weights': None,
            'scheduler_name': 'cosine',
            'scheduler_warmup_steps': 500,
            'spatial_softmax_num_keypoints': 32,
            'type': 'diffusion',
            'use_amp': False,
            'use_film_scale_modulation': True,
            'use_group_norm': True,
            'use_separate_rgb_encoder_per_camera': False,
            'vision_backbone': 'resnet18'},
 'resume': False,
 'save_checkpoint': True,
 'save_freq': 20000,
 'scheduler': {'name': 'cosine', 'num_warmup_steps': 500, 'type': 'diffuser'},
 'seed': 100000,
 'steps': 100000,
 'use_policy_training_preset': True,
 'wandb': {'disable_artifact': False,
           'enable': True,
           'entity': 'dominique-paul',
           'mode': None,
           'notes': None,
           'project': 'chesso',
           'run_id': None}}
Logs will be synced with wandb.
INFO 2025-07-11 08:18:41 db_utils.py:103 Track this run --> https://wandb.ai/dominique-paul/chesso/runs/6m4srvsn
INFO 2025-07-11 08:18:41 ts/train.py:127 Creating dataset
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1504/1504 [00:00<00:00, 380633.15it/s]
INFO 2025-07-11 08:18:50 ts/train.py:138 Creating policy
INFO 2025-07-11 08:18:52 ts/train.py:144 Creating optimizer and scheduler
INFO 2025-07-11 08:18:52 ts/train.py:156 Output dir: outputs/train/2025-07-11/08-18-39_diffusion
INFO 2025-07-11 08:18:52 ts/train.py:159 cfg.steps=100000 (100K)
INFO 2025-07-11 08:18:52 ts/train.py:160 dataset.num_frames=558659 (559K)
INFO 2025-07-11 08:18:52 ts/train.py:161 dataset.num_episodes=1504
INFO 2025-07-11 08:18:52 ts/train.py:162 num_learnable_params=266622758 (267M)
INFO 2025-07-11 08:18:52 ts/train.py:163 num_total_params=266622806 (267M)
INFO 2025-07-11 08:18:52 ts/train.py:202 Start offline training on a fixed dataset
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
INFO 2025-07-11 08:19:57 ts/train.py:232 step:1 smpl:64 ep:0 epch:0.00 loss:1.175 grdn:9.978 lr:2.0e-07 updt_s:1.693 data_s:62.045
INFO 2025-07-11 08:19:58 ts/train.py:232 step:2 smpl:128 ep:0 epch:0.00 loss:1.154 grdn:10.317 lr:4.0e-07 updt_s:0.538 data_s:0.001
INFO 2025-07-11 08:19:58 ts/train.py:232 step:3 smpl:192 ep:1 epch:0.00 loss:1.208 grdn:9.678 lr:6.0e-07 updt_s:0.302 data_s:0.001
INFO 2025-07-11 08:19:58 ts/train.py:232 step:4 smpl:256 ep:1 epch:0.00 loss:1.162 grdn:8.904 lr:8.0e-07 updt_s:0.290 data_s:0.001
INFO 2025-07-11 08:19:58 ts/train.py:232 step:5 smpl:320 ep:1 epch:0.00 loss:1.142 grdn:8.933 lr:1.0e-06 updt_s:0.285 data_s:0.001
INFO 2025-07-11 08:19:59 ts/train.py:232 step:6 smpl:384 ep:1 epch:0.00 loss:1.121 grdn:8.353 lr:1.2e-06 updt_s:0.305 data_s:0.001
INFO 2025-07-11 08:19:59 ts/train.py:232 step:7 smpl:448 ep:1 epch:0.00 loss:1.080 grdn:6.481 lr:1.4e-06 updt_s:0.303 data_s:0.001
INFO 2025-07-11 08:19:59 ts/train.py:232 step:8 smpl:512 ep:1 epch:0.00 loss:1.070 grdn:5.875 lr:1.6e-06 updt_s:0.301 data_s:0.001
INFO 2025-07-11 08:20:13 ts/train.py:232 step:9 smpl:576 ep:2 epch:0.00 loss:1.051 grdn:4.482 lr:1.8e-06 updt_s:0.302 data_s:13.330
INFO 2025-07-11 08:20:13 ts/train.py:232 step:10 smpl:640 ep:2 epch:0.00 loss:1.073 grdn:4.051 lr:2.0e-06 updt_s:0.296 data_s:0.001
INFO 2025-07-11 08:20:14 ts/train.py:232 step:11 smpl:704 ep:2 epch:0.00 loss:1.061 grdn:3.277 lr:2.2e-06 updt_s:0.286 data_s:0.000
INFO 2025-07-11 08:20:14 ts/train.py:232 step:12 smpl:768 ep:2 epch:0.00 loss:1.060 grdn:4.332 lr:2.4e-06 updt_s:0.300 data_s:0.001
INFO 2025-07-11 08:20:14 ts/train.py:232 step:13 smpl:832 ep:2 epch:0.00 loss:1.032 grdn:4.821 lr:2.6e-06 updt_s:0.301 data_s:0.001
INFO 2025-07-11 08:20:15 ts/train.py:232 step:14 smpl:896 ep:2 epch:0.00 loss:1.058 grdn:5.416 lr:2.8e-06 updt_s:0.287 data_s:0.001
INFO 2025-07-11 08:20:15 ts/train.py:232 step:15 smpl:960 ep:3 epch:0.00 loss:1.046 grdn:5.616 lr:3.0e-06 updt_s:0.286 data_s:0.001
INFO 2025-07-11 08:20:15 ts/train.py:232 step:16 smpl:1K ep:3 epch:0.00 loss:1.037 grdn:5.682 lr:3.2e-06 updt_s:0.301 data_s:0.001
INFO 2025-07-11 08:20:16 ts/train.py:232 step:17 smpl:1K ep:3 epch:0.00 loss:1.090 grdn:6.226 lr:3.4e-06 updt_s:0.295 data_s:0.595
INFO 2025-07-11 08:20:20 ts/train.py:232 step:18 smpl:1K ep:3 epch:0.00 loss:1.045 grdn:5.922 lr:3.6e-06 updt_s:0.296 data_s:3.718
INFO 2025-07-11 08:20:22 ts/train.py:232 step:19 smpl:1K ep:3 epch:0.00 loss:1.046 grdn:5.628 lr:3.8e-06 updt_s:0.309 data_s:2.049
INFO 2025-07-11 08:20:25 ts/train.py:232 step:20 smpl:1K ep:3 epch:0.00 loss:1.083 grdn:5.709 lr:4.0e-06 updt_s:0.300 data_s:2.009
INFO 2025-07-11 08:20:27 ts/train.py:232 step:21 smpl:1K ep:4 epch:0.00 loss:1.054 grdn:5.155 lr:4.2e-06 updt_s:0.297 data_s:2.076
INFO 2025-07-11 08:20:32 ts/train.py:232 step:22 smpl:1K ep:4 epch:0.00 loss:1.021 grdn:4.681 lr:4.4e-06 updt_s:0.299 data_s:4.280
INFO 2025-07-11 08:20:32 ts/train.py:232 step:23 smpl:1K ep:4 epch:0.00 loss:1.041 grdn:4.789 lr:4.6e-06 updt_s:0.289 data_s:0.001
INFO 2025-07-11 08:20:34 ts/train.py:232 step:24 smpl:2K ep:4 epch:0.00 loss:1.004 grdn:4.049 lr:4.8e-06 updt_s:0.290 data_s:1.659
INFO 2025-07-11 08:20:38 ts/train.py:232 step:25 smpl:2K ep:4 epch:0.00 loss:1.008 grdn:3.758 lr:5.0e-06 updt_s:0.301 data_s:3.332
INFO 2025-07-11 08:20:42 ts/train.py:232 step:26 smpl:2K ep:4 epch:0.00 loss:1.010 grdn:2.539 lr:5.2e-06 updt_s:0.298 data_s:3.986
INFO 2025-07-11 08:20:44 ts/train.py:232 step:27 smpl:2K ep:5 epch:0.00 loss:1.011 grdn:2.537 lr:5.4e-06 updt_s:0.297 data_s:2.061
INFO 2025-07-11 08:20:49 ts/train.py:232 step:28 smpl:2K ep:5 epch:0.00 loss:0.983 grdn:2.579 lr:5.6e-06 updt_s:0.293 data_s:4.441
INFO 2025-07-11 08:20:49 ts/train.py:232 step:29 smpl:2K ep:5 epch:0.00 loss:0.993 grdn:3.506 lr:5.8e-06 updt_s:0.309 data_s:0.001
INFO 2025-07-11 08:20:56 ts/train.py:232 step:30 smpl:2K ep:5 epch:0.00 loss:0.988 grdn:3.431 lr:6.0e-06 updt_s:0.292 data_s:6.278
INFO 2025-07-11 08:20:56 ts/train.py:232 step:31 smpl:2K ep:5 epch:0.00 loss:1.016 grdn:3.568 lr:6.2e-06 updt_s:0.302 data_s:0.000
INFO 2025-07-11 08:20:56 ts/train.py:232 step:32 smpl:2K ep:6 epch:0.00 loss:1.001 grdn:3.718 lr:6.4e-06 updt_s:0.311 data_s:0.000
INFO 2025-07-11 08:20:59 ts/train.py:232 step:33 smpl:2K ep:6 epch:0.00 loss:0.994 grdn:3.936 lr:6.6e-06 updt_s:0.307 data_s:2.518
INFO 2025-07-11 08:21:04 ts/train.py:232 step:34 smpl:2K ep:6 epch:0.00 loss:0.955 grdn:3.822 lr:6.8e-06 updt_s:0.307 data_s:4.174
INFO 2025-07-11 08:21:08 ts/train.py:232 step:35 smpl:2K ep:6 epch:0.00 loss:0.983 grdn:3.714 lr:7.0e-06 updt_s:0.297 data_s:4.387
INFO 2025-07-11 08:21:11 ts/train.py:232 step:36 smpl:2K ep:6 epch:0.00 loss:0.995 grdn:3.221 lr:7.2e-06 updt_s:0.307 data_s:2.052

The update step takes 0.3 seconds (I'm running this on an H100 GPU), but the data fetching step takes up to 6 seconds in some cases! Kudos to the person who added these print statements btw! Training for 100 steps takes 3 minutes 40 seconds. Hence, training for 100_000 steps would take 2.5 days. This is very long!

When I reduce the batch size from 64 to 8 it takes 4 minutes and 20 seconds to train on the equivalent number of samples.

When I reduce the log_freq to 200 I get a more constant number for the data_s, suggesting that the numbers above may be the result of the high logging frequency, yet also in the log_freq=200 run the data_s is still a multiple of the updt_s:

INFO 2025-07-11 09:25:41 ts/train.py:232 step:200 smpl:2K ep:4 epch:0.00 loss:0.659 grdn:6.386 lr:2.0e-05 updt_s:0.093 data_s:0.255
INFO 2025-07-11 09:26:40 ts/train.py:232 step:400 smpl:3K ep:9 epch:0.01 loss:0.148 grdn:3.543 lr:6.0e-05 updt_s:0.074 data_s:0.219
INFO 2025-07-11 09:27:40 ts/train.py:232 step:600 smpl:5K ep:13 epch:0.01 loss:0.098 grdn:2.082 lr:9.5e-05 updt_s:0.074 data_s:0.223
INFO 2025-07-11 09:28:40 ts/train.py:232 step:800 smpl:6K ep:17 epch:0.01 loss:0.076 grdn:1.578 lr:1.0e-04 updt_s:0.074 data_s:0.226
INFO 2025-07-11 09:29:42 ts/train.py:232 step:1K smpl:8K ep:22 epch:0.01 loss:0.066 grdn:1.321 lr:1.0e-04 updt_s:0.074 data_s:0.230
INFO 2025-07-11 09:30:43 ts/train.py:232 step:1K smpl:10K ep:26 epch:0.02 loss:0.066 grdn:1.241 lr:1.0e-04 updt_s:0.074 data_s:0.232
INFO 2025-07-11 09:31:45 ts/train.py:232 step:1K smpl:11K ep:30 epch:0.02 loss:0.054 grdn:1.031 lr:1.0e-04 updt_s:0.074 data_s:0.236

Training models are largely dataloader-constrained, therefore, which poses a bottleneck for more serious training runs.

Also, when I increase num_workers to 16 or more one of them sometimes crashes and causes training to fail on a machine where I have 26 CPUs.

Finetuning SmolVLA is also very slow

When finetuning a SmolVLA model with this command:

python lerobot/scripts/train.py \
  --output_dir=outputs/train/smolvla_1500 \
  --dataset.repo_id=dopaul/1500_chess_moves \
  --policy.type=smolvla \
  --policy.device=cuda \
  --batch_size=64 \
  --steps=600000 \
  --log_freq=200 \
  --save_checkpoint=true \
  --save_freq=20000 \
  --wandb.enable=true \
  --wandb.entity=dominique-paul \
  --wandb.project=chesso \
  --num_workers=16 \
  --dataset.video_backend=pyav \
  --scheduler.type=cosine_decay_with_warmup \
  --scheduler.num_warmup_steps=1000 \
  --scheduler.num_decay_steps=90000 \
  --scheduler.peak_lr=0.0001 \
  --scheduler.decay_lr=2.5e-06

I get an average update_s time of 0.62 seconds and average data_s time of 0.8 seconds (numbers taken from wandb log).

1477
2025-07-05 18:07:36
INFO 2025-07-05 18:07:36 ts/train.py:232 step:97K smpl:6M ep:17K epch:33.78 loss:0.021 grdn:0.156 lr:2.5e-06 updt_s:0.580 data_s:0.699
1478
2025-07-05 18:07:36
WARNING 2025-07-05 18:07:36 db_utils.py:141 WandB logging of key "losses_after_forward" was ignored as its type "<class 'torch.Tensor'>" is not handled by this wrapper.
1479
2025-07-05 18:07:36
WARNING 2025-07-05 18:07:36 db_utils.py:141 WandB logging of key "losses_after_rm_padding" was ignored as its type "<class 'torch.Tensor'>" is not handled by this wrapper.
1480
2025-07-05 18:11:52
INFO 2025-07-05 18:11:52 ts/train.py:232 step:98K smpl:6M ep:17K epch:33.85 loss:0.022 grdn:0.162 lr:2.5e-06 updt_s:0.580 data_s:0.698
1481
2025-07-05 18:11:52
WARNING 2025-07-05 18:11:52 db_utils.py:141 WandB logging of key "losses_after_forward" was ignored as its type "<class 'torch.Tensor'>" is not handled by this wrapper.
1482
2025-07-05 18:11:52
WARNING 2025-07-05 18:11:52 db_utils.py:141 WandB logging of key "losses_after_rm_padding" was ignored as its type "<class 'torch.Tensor'>" is not handled by this wrapper.
1483
2025-07-05 18:16:09
INFO 2025-07-05 18:16:09 ts/train.py:232 step:98K smpl:6M ep:17K epch:33.92 loss:0.021 grdn:0.148 lr:2.5e-06 updt_s:0.577 data_s:0.709
1484
2025-07-05 18:16:09
WARNING 2025-07-05 18:16:09 db_utils.py:141 WandB logging of key "losses_after_forward" was ignored as its type "<class 'torch.Tensor'>" is not handled by this wrapper.
1485
2025-07-05 18:16:09
WARNING 2025-07-05 18:16:09 db_utils.py:141 WandB logging of key "losses_after_rm_padding" was ignored as its type "<class 'torch.Tensor'>" is not handled by this wrapper.
1486
2025-07-05 18:20:25
INFO 2025-07-05 18:20:25 ts/train.py:232 step:98K smpl:6M ep:17K epch:33.99 loss:0.021 grdn:0.153 lr:2.5e-06 updt_s:0.579 data_s:0.698

But everything is fine when training an ACT model

I looked at my ACT training run, and there the data loading happens blazingly fast!

This is the training command:

python lerobot/scripts/train.py \
  --policy.type=act \
  --dataset.repo_id=dopaul/1500_chess_moves \
  --batch_size=8 \
  --steps=200000 \
  --eval_freq=20000 \
  --log_freq=200 \
  --dataset.video_backend=pyav \
  --save_checkpoint=true \
  --save_freq=20000 \
  --wandb.enable=true \
  --wandb.entity=dominique-paul \
  --wandb.project=chesso \
  --output_dir=outputs/train/1500_chess_moves_act

which produces these logs:


2025-07-10 19:13:44
Logs will be synced with wandb.
  2
2025-07-10 19:13:44
INFO 2025-07-10 19:13:43 db_utils.py:103 Track this run --> https://wandb.ai/dominique-paul/chesso/runs/m1pfyojr
  3
2025-07-10 19:13:44
INFO 2025-07-10 19:13:43 ts/train.py:127 Creating dataset
  4
2025-07-10 19:13:45
Downloading data: 100%|██████████| 1504/1504 [00:00<00:00, 43271.69files/s]
  5
2025-07-10 19:13:49
Generating train split: 558659 examples [00:02, 190439.19 examples/s]
  6
2025-07-10 19:13:49
INFO 2025-07-10 19:13:59 ts/train.py:138 Creating policy1 examples/s]
  7
2025-07-10 19:14:00
INFO 2025-07-10 19:14:00 ts/train.py:144 Creating optimizer and scheduler
  8
2025-07-10 19:14:00
INFO 2025-07-10 19:14:00 ts/train.py:156 Output dir: outputs/train/1500_chess_moves_act
  9
2025-07-10 19:14:00
INFO 2025-07-10 19:14:00 ts/train.py:159 cfg.steps=200000 (200K)
 10
2025-07-10 19:14:00
INFO 2025-07-10 19:14:00 ts/train.py:160 dataset.num_frames=558659 (559K)
 11
2025-07-10 19:14:00
INFO 2025-07-10 19:14:00 ts/train.py:161 dataset.num_episodes=1504
 12
2025-07-10 19:14:00
INFO 2025-07-10 19:14:00 ts/train.py:162 num_learnable_params=51597190 (52M)
 13
2025-07-10 19:14:00
INFO 2025-07-10 19:14:00 ts/train.py:163 num_total_params=51597238 (52M)
 14
2025-07-10 19:14:00
INFO 2025-07-10 19:14:00 ts/train.py:202 Start offline training on a fixed dataset
 15
2025-07-10 19:15:08
INFO 2025-07-10 19:15:08 ts/train.py:232 step:200 smpl:2K ep:4 epch:0.00 loss:6.925 grdn:152.789 lr:1.0e-05 updt_s:0.326 data_s:0.012
 16
2025-07-10 19:16:10
INFO 2025-07-10 19:16:10 ts/train.py:232 step:400 smpl:3K ep:9 epch:0.01 loss:3.170 grdn:84.457 lr:1.0e-05 updt_s:0.308 data_s:0.000
 17
2025-07-10 19:17:12
INFO 2025-07-10 19:17:12 ts/train.py:232 step:600 smpl:5K ep:13 epch:0.01 loss:2.743 grdn:74.084 lr:1.0e-05 updt_s:0.309 data_s:0.000
 18
2025-07-10 19:18:14
INFO 2025-07-10 19:18:14 ts/train.py:232 step:800 smpl:6K ep:17 epch:0.01 loss:2.440 grdn:69.514 lr:1.0e-05 updt_s:0.307 data_s:0.000
 19
2025-07-10 19:19:16
INFO 2025-07-10 19:19:16 ts/train.py:232 step:1K smpl:8K ep:22 epch:0.01 loss:2.150 grdn:64.544 lr:1.0e-05 updt_s:0.307 data_s:0.000
 20
2025-07-10 19:20:18
INFO 2025-07-10 19:20:18 ts/train.py:232 step:1K smpl:10K ep:26 epch:0.02 loss:1.958 grdn:62.236 lr:1.0e-05 updt_s:0.309 data_s:0.000
 21
2025-07-10 19:21:19
INFO 2025-07-10 19:21:19 ts/train.py:232 step:1K smpl:11K ep:30 epch:0.02 loss:1.785 grdn:58.473 lr:1.0e-05 updt_s:0.308 data_s:0.000
 22
2025-07-10 19:22:21
INFO 2025-07-10 19:22:21 ts/train.py:232 step:2K smpl:13K ep:34 epch:0.02 loss:1.644 grdn:56.324 lr:1.0e-05 updt_s:0.309 data_s:0.000
 23
2025-07-10 19:23:23
INFO 2025-07-10 19:23:23 ts/train.py:232 step:2K smpl:14K ep:39 epch:0.03 loss:1.493 grdn:53.052 lr:1.0e-05 updt_s:0.307 data_s:0.000
 24
2025-07-10 19:24:25
INFO 2025-07-10 19:24:25 ts/train.py:232 step:2K smpl:16K ep:43 epch:0.03 loss:1.378 grdn:51.073 lr:1.0e-05 updt_s:0.308 data_s:0.000
 25
2025-07-10 19:25:27
INFO 2025-07-10 19:25:27 ts/train.py:232 step:2K smpl:18K ep:47 epch:0.03 loss:1.266 grdn:49.097 lr:1.0e-05 updt_s:0.309 data_s:0.000
 26
2025-07-10 19:26:29
INFO 2025-07-10 19:26:29 ts/train.py:232 step:2K smpl:19K ep:52 epch:0.03 loss:1.168 grdn:46.597 lr:1.0e-05 updt_s:0.307 data_s:0.000
 27
2025-07-10 19:27:30
INFO 2025-07-10 19:27:30 ts/train.py:232 step:3K smpl:21K ep:56 epch:0.04 loss:1.071 grdn:44.567 lr:1.0e-05 updt_s:0.307 data_s:0.000
 28
2025-07-10 19:28:32
INFO 2025-07-10 19:28:32 ts/train.py:232 step:3K smpl:22K ep:60 epch:0.04 loss:0.982 grdn:42.166 lr:1.0e-05 updt_s:0.308 data_s:0.000
 29
2025-07-10 19:29:34
INFO 2025-07-10 19:29:34 ts/train.py:232 step:3K smpl:24K ep:65 epch:0.04 loss:0.902 grdn:39.804 lr:1.0e-05 updt_s:0.307 data_s:0.000

Expected behavior

Ideally, the data fetching part would be faster than the network update, such that training is compute-bound, not dataloading bound. If somebody knows what the reason might be then I'd be happy to take a look at a fix!

Jul 11 '25 08:07 DominiquePaul

Hello, may I ask if there is any solution or workaround available now for the slow data loading issue? I’m experiencing the same issue.

Sep 13 '25 05:09 HuBocheng

I am also experiencing the same issue. 😵

During the training of the Diffusion Policy, the training progress pauses every few steps equal to num_workers, at which point the GPU utilization drops to 0%. The iotop tool shows that each thread is reading at about 20MB/s. After the reading is complete, the hard disk read speed significantly decreases, and then training resumes. This cycle repeats, making the training process very slow.

In contrast, when training ACT, this issue does not occur. Based on my observation, the hard disk read and training are almost continuous, and the training speed is much faster.

Oct 05 '25 05:10 Larryi

Hi, I am facing the same issue! Training diffusion policy takes 36-48 hours for 80-100k iterations on a single A100. Regarding the data_s, I have similar trend as @DominiquePaul , any idea how to fix this?

Nov 07 '25 00:11 prachigarg23

waiting

Nov 22 '25 13:11 millioniron

The same

Dec 01 '25 04:12 hairuoliu1

The same for training pi05 with 500 demo(300k frames). GPU is fluctuating

Dec 10 '25 18:12 exaFLOPs26

I have the same problem.

Dec 15 '25 10:12 sotanakamura