feat: add `remove_episodes` utility
What this does (✨ Feature)
This commit introduces a remove_episodes function to remove specific episodes from a dataset, and will automatically modify all required data, video, and metadata.
The function will safely remove the episodes, meaning that if at any point during the process a failure occurs, the original dataset is preserved. Additionally, even if there is no error, the original dataset is backed up locally (timestamped) in case it is needed to revert to.
How it was tested
- Added
test_remove_episodesintests/test_datasets.py. - Tested manually on the
lerobot/aloha_mobile_cabinetdataset, manual inspection of the files and dataset seemed correct.
How to checkout & try? (for the reviewer)
pytest -sx tests/test_datasets.py::test_remove_episodes
@bsprenger Awesome stuff here!
The logic here makes sense, and seems to capture all of the aspects of the dataset to be modified.
The test also works for me, and I tried this on a local dataset of mine and got the expected results.
Two comments:
- I think this would strongly benefit from being a CLI tool though that can be called similar to the tools to visualize the datasets in
lerobot/scripts. This is what I was working on on my branch - The backup directory that's generated appears to not be ephemeral, but I believe this cleanup should happen automatically (or get dumped into a hidden folder like
.tmp)- The alternative is that there could be an additional flag on the CLI tool to explicitly keep the backup, with the default being not to keep it
Let me know your thoughts!
Hey @brysonjones thanks for the review!
I have made it a CLI tool and moved it to scripts.
For the backup, I added an optional argument backup which can be:
True: will backup to a default location- a string or path to backup to
False(default) - will not backup
Through the CLI, then you could pass --backup to backup to the default location or --backup /some/path for somewhere specific. What do you think? The default location could definitely be up for debate (I just put it as a timestamped folder beside the original for now).
Let me know what you think!
great work! worked very well for me
This worked well for me! @bsprenger
Should provide a good base to expand some other capabilities going forward
@Cadene Any thoughts from you on this?
Thanks for doing this @bsprenger ! Worked great for me as well! I tried it and noticed a few things:
- I removed an episode using your script
- Checked locally it was indeed removed. But remote was not updated (by design? I don't see push_to_hub in the code)
- Resumed recording of 2 more episodes and confirmed the files online (info.json looks correct, dataset card not updated?). ~When I tried to resume recording locally, I got metadata compatibility check failure. Is it due to updated dataset not getting pushed to remote?~ (Turned out that I forgot to update the robot_devices/robots/configs.py!)
@brysonjones @bsprenger Really cool work. I have one comment ;)
Did you measure the time it takes?
What I had in mind is to not modify the hf_dataset.
It's ok if the episode indices of a dataset are not consecutive numbers ; same for the task indices.
As a result, deleting an episode will scale to any dataset size.
What do you think?
@Cadene Ah, now I understand the motivation for why the dataset v2 split each episode up into individual files!
Indeed it's way faster to not reindex, and simply filter out the removed episode files. And it should scale well for large datasets. I modified my commit accordingly.
@liuhuanjim013 Thanks for testing! Indeed, I did not automatically update the remote. I thought it would be better to intentionally push the dataset afterwards, when satisfied that the removal is what you wanted. I think there exists a script to do that? Remi do you have thoughts on this?
@Cadene Yeah, I think this makes sense!
@bsprenger @liuhuanjim013 Perhaps this could be another flag on the script to just push to the hub, similar to how the dataset collection script has that flag?
+1 to the flag idea!
@brysonjones @liuhuanjim013 I added some flags for pushing to the hub. Let me know what you think!
I also added a few lines to update the dataset card.
@bsprenger I tried it and checked the files online (I tried to remove episode 15-17 from a dataset of 50 episodes):
episodes,jsonl,episodes_stats.jsonl,info.jsonall seem to be updated- datacard also seems to be updated with the correct video numbers
- the removed videos are still in the videos folders
Then I tried to visualize the dataset locally:
- The total number of videos are indeed reduced (from 50 to 47)
- But I could still see the links to the removed episodes: when clicking on them, I got "Internal Server Error" (I removed episodes 15-17, the links are still there)
- The links to the last 3 episodes are gone (episode 47-49) Could you take a look and see if you have the same problem?
@liuhuanjim013 Thanks for catching these! I reproduced them and fixed them.
Hub Issue: I initially overlooked that pushing doesn’t automatically remove files missing from the local dataset. This is now fixed.
Instead of a full sync function, I explicitly remove the target episodes' data/video files from the hub. I tend to prefer this very explicit solution to avoid too much hidden behaviour which could result from a full sync. However, this means that if your local and remote folders are out of sync right now, you may need to delete the remote and reupload the local once. After that, the removal script should keep them aligned.
Visualization Scripts: The issue stemmed from the scripts' assumption that episode indices are sequential, which is now fixed. I’ve updated all instances I could find, but if you spot any more, let me know!
@liuhuanjim013 did you get a chance to try this? 🙂
@bsprenger tested on a different repo and it works! so it should be good for anyone to try the latest version! for the one that had trouble, i'll need to look further to see how to recover it
Hey @Cadene, any thoughts on this? 🙂 any more feedback to get this merged?
i tested on the old(broken) dataset, did the following to fix it:
- removed (renamed) dataset on hugging face
- used latest remove episode script to remove a new episode (e.g. 19) and push to hub
one thing i noticed using the visualization script: when using down arrow key to see the next episode, i would get 500 Internal Server Error when navigating from the episode (e.g. 14) before the removed one (e.g. 15), it tries to load the removed episode (e.g. 15) and gets into error. the main page does not show the removed episodes (e.g. 15, 16, 17, 19).
one thing i noticed using the visualization script: when using down arrow key to see the next episode, i would get 500 Internal Server Error when navigating from the episode (e.g. 14) before the removed one (e.g. 15), it tries to load the removed episode (e.g. 15) and gets into error. the main page does not show the removed episodes (e.g. 15, 16, 17, 19).
added a fix (https://github.com/huggingface/lerobot/pull/943)
I tried to delete an episode ( 306 from Chojins/chess_game_001_blue_stereo) and it appeared to work but now visualising the dataset shows the following error on episode 306 (presumably the old episode 307)
Internal Server Error The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
I tried recording another episode to see if that would fix it but get the following error when trying to push to the hub:
Traceback (most recent call last):
File "/home/jacob/lerobot/lerobot/scripts/control_robot.py", line 393, in
Can I repair my dataset somehow?
Hi @Chojins, I assume you are using the online visualization tool. Indeed, it will show an error, because the online tool would need the changes from this PR to be merged in order to work.
If you use this PR's branch and run the visualizer locally, it should work:
python lerobot/scripts/visualize_dataset_html.py --repo-id=chojins/chess_game_001_blue_stereo
When I run this, it appears that the episode was correctly removed!
However, thanks to you I did realize that there was another issue with the remove_episodes script. It was pushing to the hub but not retagging it. This means that if you delete the local cache it would not redownload correctly, and also other people would not be able to correctly download it.
I updated the script yesterday to fix this going forward, but you will need to repair your dataset on the hub. Here's how you can do it.
from huggingface_hub import HfApi
hub_api = HfApi()
# we have to delete the old tag, which points to an older commit of the dataset
hub_api.delete_tag('chojins/chess_game_001_blue_stereo', tag='v2.1', repo_type='dataset')
# then we create a new tag which points to the most recent commit (i.e. the ones that deleted episodes)
hub_api.create_tag('chojins/chess_game_001_blue_stereo', tag='v2.1', revision="main", repo_type="dataset")
Hi @bsprenger , thanks very much for the great work. I got a new question during tring your script. If I use the remove script to delete some episodes from the head or middle of a dataset, can I still resume to collect new episodes based on this reduced dataset? (using python control_robot.py --control.type=record --control.resume=true) I observed that deleting episodes will reduce the values of "meta.info.total_episodes" and "meta.info.total_frames", and these two parameters will affect the global index value on the newly added episodes when resume to record data. This may lead index overlap on (parquet file names,meta.episode,"episode_index" key and "index" keys) with previous episodes.
For example, if I have a dataset with 10 episodes and delete episode_index=1,2,3, then the dataset's meta.episode becomes [0, 4, 5, 6, 7, 8, 9] and meta.info.total_episodes=7. When a new episode is added to this reduced dataset, its episode_index=meta.info.total_episodes=7, which overlap with the existing episode_index=7 in the dataset. This overlap will exist on (parquet file name,meta.episode,"episode_index" key,"index" keys), which will lead to visualization or training failure. Is there any way to resolve this overlap? I often encounter situations where I need to clean the dataset and then supplement it with additional new data.
Absolutely would love this feature, if I need to remove any episodes when training the next few days I will test this out and put any feedback here
EDIT:
Works for me, only a few comments on the command
Hi @robinhyg , indeed I see the issue! I think this is a problem with the data collection, which assumes sequential indices. I think that could be resolved in a separate PR since it is less an issue with this script, and more a limitation of the dataset format itself.
Thanks for the review @nicholas-maselli, I'm glad you and others in the community found this script useful!
@aliberts @imstevenpmwork Would the HF team be able to review this? There's been great community testing and a lot of interest in using it! 😊 thanks!
hi @bsprenger , thanks for your amazing work! 🙌When I ran the command:
python lerobot/scripts/remove_episodes.py \ --repo-id tree1258/0625test \ --root ~/.cache/huggingface/lerobot/tree1258/0625test \ --episodes 45 \ --backup true \ --push-to-hub 0
It perfectly removed the 45th episode out of my 50-episode dataset — which was exactly the problematic one. Afterwards, I pushed the remaining 49 episodes to a new Hugging Face repository.
However, when I tried to start training using:
python lerobot/scripts/train.py \ --dataset.repo_id=tree1258/0625test \ --policy.type=act \ --output_dir=outputs/train/0625_act_test \ --job_name=0625_act_test \ --policy.device=cuda \ --wandb.enable=false
I encountered the following error:
IndexError: index 49 is out of bounds for dimension 0 with size 49
I’ve looked into it but haven’t been able to find a solution. Do you have any idea what might be causing this? Thanks so much for your help!
Hello! Is this stale or still in dev? Looking to delete some episodes!
This feature is greatly needed, thanks for creating it. hope the team can approve it
hi @bsprenger , thanks for your amazing work! 🙌When I ran the command:
python lerobot/scripts/remove_episodes.py \ --repo-id tree1258/0625test \ --root ~/.cache/huggingface/lerobot/tree1258/0625test \ --episodes 45 \ --backup true \ --push-to-hub 0It perfectly removed the 45th episode out of my 50-episode dataset — which was exactly the problematic one. Afterwards, I pushed the remaining 49 episodes to a new Hugging Face repository. However, when I tried to start training using:
python lerobot/scripts/train.py \ --dataset.repo_id=tree1258/0625test \ --policy.type=act \ --output_dir=outputs/train/0625_act_test \ --job_name=0625_act_test \ --policy.device=cuda \ --wandb.enable=falseI encountered the following error:
IndexError: index 49 is out of bounds for dimension 0 with size 49I’ve looked into it but haven’t been able to find a solution. Do you have any idea what might be causing this? Thanks so much for your help!
@Avory1258
Not sure if this is still relevant, since it's a few weeks old. But I looked at your dataset, and there is an issue with the metadata. The episode was removed but the metadata files (episodes_stats.jsonl and episodes.jsonl) weren't correctly updated (the index jumps from 44 to 46, skipping 45, causing the error). All indices should be sequential, so when deleting the 45th, the subsequent indices should have been decremented by 1.
Not sure if this was an issue with the feature's code or if you did something wrong. If it's still relevant, try running the script again or manually edit metadata files.
@bsprenger @Cadene What is blocking this merge at the moment? Indeed, a very useful script.