User/tgossin/2025 05 25 merge dataset
PR: Add merge_dataset.py utility script
This PR introduces merge_dataset.py, a standalone CLI tool to merge two LeRobot-format datasets into a single, coherent dataset while guaranteeing that all episode indices remain unique and that accompanying meta-files stay consistent.
✨ Key features
- Index-safe merge – renumbers episode indices from the second dataset so there are no collisions.
-
Full meta consolidation – concatenates
episodes.jsonl,episodes_stats.jsonl,tasks.jsonl, and recomputesinfo.jsoncounters (total_episodes,total_frames, etc.). -
Typed config & CLI – uses draccus for a typed, self-documenting config (
MergeConfig) and an intuitive command-line interface.
🛠️ Typical usage
python merge_dataset.py \
--dataset1=/data/robot/run_A \
--dataset2=/data/robot/run_B \
--output_dir=/data/robot/merged
The script copies Parquet files and videos, merges all meta-files, and prints a concise success summary.
🚀 Motivation
Training or benchmarking often requires combining multiple recording runs, but naïvely concatenating folders breaks episode indexing and corrupts task statistics. This utility automates a safe, reproducible merge so researchers can focus on experimentation instead of data wrangling.
✅ Checklist
- [x] Unit-tested on synthetic datasets (100 % unique indices, intact meta totals).
- [x] Follows project style (ruff, pre-commit).
- [x] No external dependencies beyond draccus / stdlib.
- [x] Added docstring & in-file example for quick reference.
This will be extremely useful but only two datasets? This seems unnecessarily limiting. Rather, it should take an arbitrary number of datasets to merge.
Okay, I'll change it to allow an unlimited number of datasets.
@RoBartic There you go, I made it possible to merge multiple files and added a feature to delete specific episodes.
python dataset_tool_cli.py merge \ --datasets "/path/to/datasetA /path/to/datasetB" \ --output_dir /path/to/merged_dataset
python dataset_tool_cli.py delete \ --dataset_dir /path/to/dataset_to_modify \ --episode_id 32 \ --verbose
So fast! Awesome work! I confirmed with a simple test (3 one-episode datasets) that it appears to be working. Thank you!
A pleasure to contribute to this project!
Hi, @landEpita , warmly welcome PR for merging utils in any4lerobot.
Hi @landEpita, merging datasets works great. For deleting would be nice if i could delete multiple episodes at once.
does the merge of dataset_tool_cli.py support multiple chunks?
it would be like to support multiple chunks...