lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

Large scale training

Open richardrl opened this issue 5 months ago • 2 comments

Hi, continuing this thread: https://github.com/huggingface/lerobot/issues/436 I am wondering if you have any benchmarks or numbers on the storage required and throughput for large scale training?

I know SmolVLA trained on a large amount of data.

I understand you all use Parquet to essentially store the full images. I think this can be quite fast, but perhaps take up a lot of disk space.

I am curious about your space / speed usage compared to webdataset and torchcodec (decoding mp4s on the fly) as those seem like the other scalable methods for storing data.

Or - if there are more concrete numbers of the largest dataset that was trained with LeRobot, what was the throughput, etc

richardrl avatar Jun 13 '25 06:06 richardrl

Hi, this is what we want to address with version 3 of our datasets (development is still ongoing)

aliberts avatar Jun 15 '25 13:06 aliberts

Thank you - any informal comments about what the priorities / strengths of the current implementation of LeRobot for RGB storage are? @aliberts

richardrl avatar Jun 18 '25 23:06 richardrl