Distributed v2.1 -> v3.0 conversion
We recently introduced a new dataset format, LeRobotDataset-v3. The format is built for scale, and supports a new feature we're quite excited about: streaming, allowing users to process data on the fly without storing it on disk (prohibitive for large scale datasets, ~TB of data).
We have also released a porting script, which we have used to port many datasets from the old 2.1 format to the more modern 3.0. However, the conversion script is not built for large scale datasets and performs the conversion sequentially.
We need to modify it so that it runs in a distributed way, spawning multiple workers each aggregating a subportion of the data first, with a final pooling of all the aggregate datasets.
A good starting point for this would be taking any dataset currently on the hub in v3.0 like lerobot/svla_so101_pickplace, access it in v2.1 (just use the revision="v2.1" argument when you're instantiating it with LeRobotDataset) and start playing around with distributed conversion script on a small scale. Then, the result could be tested (possibly asserting frame by frame) against the ground truth v3.0 dataset, making testing easier.
This would be very impactful because we currently support many large scale datasets which would be otherwise computationally prohibitive to port! Feel free to ping @fracapuano here on on x.com/_fracapuano for any help on this :))
I opened a PR adding distributed + parallel conversion for v2.1→v3.0: https://github.com/huggingface/lerobot/pull/2036
What’s in it
- New manifest-based orchestration (
--orchestrate) that splits work into batches and lets multiple workers convert in parallel while a writer assembles the final v3 layout. - Preserves the exact v3 structure & size policies (data
file-XXX.parquet, per-camerafile-XXX.mp4, updatedmeta/episodes,info.jsonwithcodebase_version: "v3.0"). - Safe local benchmarking via
--no-push(no Hub mutations). Optional--work-dirto keep manifests/shards outside the cache.
How to try
# Sequential (baseline)
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
--repo-id lerobot/svla_so101_pickplace --no-push
# Distributed-style (manifest)
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
--repo-id lerobot/svla_so101_pickplace \
--orchestrate --episodes-per-batch 10 --num-workers 2 --no-push
Early results (MBP M1, small DS):
- Parallel (in-process,
--max-workers 2) is ~1.7× faster than sequential. - Orchestrated mode is slower on tiny datasets (overhead dominates), but is designed to scale across large/TB-scale datasets and clusters as per the issue orignial intent.
Would love feedback on flags/defaults and any large datasets you’d like me to stress-test.
After l use the script to convert, an error will appear. It should be that the two values do not match.
hi there and thanks! @Temmp1e Can you try to pull now and test? Also, if the ds is public, can you share it so i can also run it? thanks!
how to downgrade from 3.0 to 2.1??? for phoshobot w/ huggingface.
how to downgrade from 3.0 to 2.1??? for phoshobot w/ huggingface.
u can create a new script with the assistant of claude, it is pretty straightforward for AI