lerobot
lerobot copied to clipboard
Large scale training
Hi, continuing this thread: https://github.com/huggingface/lerobot/issues/436 I am wondering if you have any benchmarks or numbers on the storage required and throughput for large scale training?
I know SmolVLA trained on a large amount of data.
I understand you all use Parquet to essentially store the full images. I think this can be quite fast, but perhaps take up a lot of disk space.
I am curious about your space / speed usage compared to webdataset and torchcodec (decoding mp4s on the fly) as those seem like the other scalable methods for storing data.
Or - if there are more concrete numbers of the largest dataset that was trained with LeRobot, what was the throughput, etc
Hi, this is what we want to address with version 3 of our datasets (development is still ongoing)
Thank you - any informal comments about what the priorities / strengths of the current implementation of LeRobot for RGB storage are? @aliberts