lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

Distributed v2.1 -> v3.0 conversion

Open fracapuano opened this issue 3 months ago • 5 comments

We recently introduced a new dataset format, LeRobotDataset-v3. The format is built for scale, and supports a new feature we're quite excited about: streaming, allowing users to process data on the fly without storing it on disk (prohibitive for large scale datasets, ~TB of data).

We have also released a porting script, which we have used to port many datasets from the old 2.1 format to the more modern 3.0. However, the conversion script is not built for large scale datasets and performs the conversion sequentially.

We need to modify it so that it runs in a distributed way, spawning multiple workers each aggregating a subportion of the data first, with a final pooling of all the aggregate datasets. A good starting point for this would be taking any dataset currently on the hub in v3.0 like lerobot/svla_so101_pickplace, access it in v2.1 (just use the revision="v2.1" argument when you're instantiating it with LeRobotDataset) and start playing around with distributed conversion script on a small scale. Then, the result could be tested (possibly asserting frame by frame) against the ground truth v3.0 dataset, making testing easier.

This would be very impactful because we currently support many large scale datasets which would be otherwise computationally prohibitive to port! Feel free to ping @fracapuano here on on x.com/_fracapuano for any help on this :))

fracapuano avatar Sep 22 '25 20:09 fracapuano

I opened a PR adding distributed + parallel conversion for v2.1→v3.0: https://github.com/huggingface/lerobot/pull/2036

What’s in it

  • New manifest-based orchestration (--orchestrate) that splits work into batches and lets multiple workers convert in parallel while a writer assembles the final v3 layout.
  • Preserves the exact v3 structure & size policies (data file-XXX.parquet, per-camera file-XXX.mp4, updated meta/episodes, info.json with codebase_version: "v3.0").
  • Safe local benchmarking via --no-push (no Hub mutations). Optional --work-dir to keep manifests/shards outside the cache.

How to try

# Sequential (baseline)
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace --no-push

# Distributed-style (manifest)
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace \
  --orchestrate --episodes-per-batch 10 --num-workers 2 --no-push

Early results (MBP M1, small DS):

  • Parallel (in-process, --max-workers 2) is ~1.7× faster than sequential.
  • Orchestrated mode is slower on tiny datasets (overhead dominates), but is designed to scale across large/TB-scale datasets and clusters as per the issue orignial intent.

Would love feedback on flags/defaults and any large datasets you’d like me to stress-test.

eDeveloperOZ avatar Sep 25 '25 03:09 eDeveloperOZ

After l use the script to convert, an error will appear. It should be that the two values do not match.

Image

Temmp1e avatar Oct 14 '25 10:10 Temmp1e

hi there and thanks! @Temmp1e Can you try to pull now and test? Also, if the ds is public, can you share it so i can also run it? thanks!

eDeveloperOZ avatar Oct 14 '25 23:10 eDeveloperOZ

how to downgrade from 3.0 to 2.1??? for phoshobot w/ huggingface.

SheppCrafd avatar Nov 21 '25 00:11 SheppCrafd

how to downgrade from 3.0 to 2.1??? for phoshobot w/ huggingface.

u can create a new script with the assistant of claude, it is pretty straightforward for AI

Temmp1e avatar Nov 21 '25 02:11 Temmp1e