lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

User/tgossin/2025 05 25 merge dataset

Open landEpita opened this issue 10 months ago • 9 comments

PR: Add merge_dataset.py utility script

This PR introduces merge_dataset.py, a standalone CLI tool to merge two LeRobot-format datasets into a single, coherent dataset while guaranteeing that all episode indices remain unique and that accompanying meta-files stay consistent.


✨ Key features

  • Index-safe merge – renumbers episode indices from the second dataset so there are no collisions.
  • Full meta consolidation – concatenates episodes.jsonl, episodes_stats.jsonl, tasks.jsonl, and recomputes info.json counters (total_episodes, total_frames, etc.).
  • Typed config & CLI – uses draccus for a typed, self-documenting config (MergeConfig) and an intuitive command-line interface.

🛠️ Typical usage

python merge_dataset.py \
    --dataset1=/data/robot/run_A \
    --dataset2=/data/robot/run_B \
    --output_dir=/data/robot/merged

The script copies Parquet files and videos, merges all meta-files, and prints a concise success summary.


🚀 Motivation

Training or benchmarking often requires combining multiple recording runs, but naïvely concatenating folders breaks episode indexing and corrupts task statistics. This utility automates a safe, reproducible merge so researchers can focus on experimentation instead of data wrangling.


✅ Checklist

  • [x] Unit-tested on synthetic datasets (100 % unique indices, intact meta totals).
  • [x] Follows project style (ruff, pre-commit).
  • [x] No external dependencies beyond draccus / stdlib.
  • [x] Added docstring & in-file example for quick reference.

landEpita avatar May 26 '25 13:05 landEpita

This will be extremely useful but only two datasets? This seems unnecessarily limiting. Rather, it should take an arbitrary number of datasets to merge.

RoBartic avatar May 29 '25 16:05 RoBartic

Okay, I'll change it to allow an unlimited number of datasets.

landEpita avatar May 29 '25 16:05 landEpita

@RoBartic There you go, I made it possible to merge multiple files and added a feature to delete specific episodes.

python dataset_tool_cli.py merge \ --datasets "/path/to/datasetA /path/to/datasetB" \ --output_dir /path/to/merged_dataset

python dataset_tool_cli.py delete \ --dataset_dir /path/to/dataset_to_modify \ --episode_id 32 \ --verbose

landEpita avatar May 29 '25 17:05 landEpita

So fast! Awesome work! I confirmed with a simple test (3 one-episode datasets) that it appears to be working. Thank you!

RoBartic avatar May 29 '25 18:05 RoBartic

A pleasure to contribute to this project!

landEpita avatar May 29 '25 19:05 landEpita

Hi, @landEpita , warmly welcome PR for merging utils in any4lerobot.

Tavish9 avatar May 30 '25 06:05 Tavish9

Hi @landEpita, merging datasets works great. For deleting would be nice if i could delete multiple episodes at once.

laurius avatar Jun 01 '25 19:06 laurius

does the merge of dataset_tool_cli.py support multiple chunks?

kokun66 avatar Jun 03 '25 20:06 kokun66

it would be like to support multiple chunks...

kokun66 avatar Jun 03 '25 20:06 kokun66