lerobot
lerobot copied to clipboard
Add torchcodec cpu
What this does
This PR replaces torchvision CPU decoding by torchcodec CPU decoding.
Also added a decode_video_frames function that wraps multiple backends, instead of calling decode_video_frames_BACKENDNAME separately. This makes it more efficient and allows us to add more decoders later on!
The decoder used is decided based on the dataset.video_backend key, but defaults to torchcodec.
How it was tested
Test and Benchmark the decoders on different datasets/policies.
How to checkout & try? (for the reviewer)
Just run the training script, with a dataset containing videos to decode. example:
python lerobot/scripts/train.py \
--output_dir=outputs/train/act_aloha_insertion \
--policy.type=act \
--dataset.repo_id=lerobot/aloha_sim_insertion_human \
--env.type=aloha \
--env.task=AlohaInsertion-v0 \
Benchmarks
Ran one benchmark on lerobot/aloha_sim_insertion_human_image dataset
Comparison: PyAV vs TorchCodec (CPU)
| Metric | PyAV | TorchCodec-CPU |
|---|---|---|
| Video to Images Load Time Ratio | 1.87 | 1.25 |
| Avg MSE | 5.14e-05 | 4.88e-05 |
| Avg PSNR | 43.17 | 43.37 |
| Avg SSIM | 0.995 | 0.995 |
What's left
~~Remove/suppress libdav1d logs (they're noisy) -> there's no env variable to disable those for now but they'll be deactivated in the next version of torchcodec.~~
PR is in a good state ✅
Torchcodec consistently outperforms pyav across all datasets and video codecs (encoders), it achieves lower MSE (better accuracy), higher PSNR (better quality), and higher SSIM (better perceptual similarity). this trend is evident across libsvtav1, libx264, and libx265, and it makes torchcodec the superior choice for both efficiency and quality. To reproduce the full results, check this link
great!, i guess cc @imstevenpmwork
Hello @jadechoghari, thanks for your contribution! This LGTM 😄