feat: adding the audio modality 🎤

Open CarolinePascal opened this issue 9 months ago • 1 comments

Adding audio modality in recordings and datasets

This PR introduces the audio modality in LeRobot. It allows users to record audio along with video using laptop-embedded or USB microphones, and to store the gathered data in a LeRobotDataset. It includes :

A new Microphone class, inspired by the Camera class, which leverages the sounddevice and soundfile python packages to manage microphones and record multi-channel audio data. This class allows to record sound in two ways : (1) as a continuous signal directly written in a file over an extended period of time (e.g. an episode) or (2) as smaller 'audio chunks', which provide a snapshot of the past few second (e.g. for policy inputs). Both recording methods are running in a separate thread which should not impact performance.
An integration of the Microphone class in the Robot class to easily define microphones and link them to cameras. By doing so, audio will be saved along with the recorded video in a single MP4 file;
An integration of the audio modality in LeRobotDataset, as a specific feature stored in a separated file (MP4 for video+audio, M4A for audio). Similarly to video files, audio files are encoded using ffmpeg. For now, only torchaudio decoding is supported (torchcodec decoding is a work in progress). Once decoded, 'audio chunks' of fixed length may be queried based on a user given timestamp;
A new set of dedicated tests.

How it was tested

The feature was tested on macOS Sequoia with a SO100 setup with two cameras (laptop and phone), and two microphones (laptop and headset). One of the microphones was linked to a camera, and the other one treated as a standalone sensor. See the SO100 entry in the config.py for more details.

How to checkout & try?

Assuming a SO100 setup

Install audio dependencies

pip install -e ".[audio]"

Query microphones indices and update the robot config.py file

python -m sounddevice

The microphones have at least one input channel

You may get the microphone sample rate as follows:

python -c "import sounddevice as sd; print(sd.query_devices(<microphone_index>)['default_samplerate'])"

Run a dummy dataset recording

python lerobot/scripts/control_robot.py \
  --robot.type=so100 \
  --control.type=record \
  --control.fps=30 \
  --control.single_task="Hear some sound." \
  --control.repo_id=so100/sound_test \
  --control.tags='["so100","sound"]' \
  --control.warmup_time_s=2 \
  --control.episode_time_s=5 \
  --control.reset_time_s=1 \
  --control.num_episodes=1 \
  --control.push_to_hub=false

You should be able to check the recorded data in your local dataset folder.

Apr 10 '25 16:04 CarolinePascal

⚠️ Running tests highlighted that tochaudio (used for audio decoding, as torchcodec does not officially supports it yet) if not compatible with ffmpeg > 7, which is now the recommended version.

There is a working fix on macOS which relies on installing ffmpeg with :

conda install ffmpeg=6.1.2 -c conda-forge

-> No conflict with the latest versions of av or torchvision (which must encapsulate their own ffmpeg version); -> Compatible with torchaudio (<7) and torchcodec (installed with conda); -> Already lists libsvtav1 among its encoders.

Apr 11 '25 17:04 CarolinePascal