feat: adding the audio modality 🎤
Adding audio modality in recordings and datasets
This PR introduces the audio modality in LeRobot. It allows users to record audio along with video using laptop-embedded or USB microphones, and to store the gathered data in a LeRobotDataset. It includes :
- A new
Microphoneclass, inspired by theCameraclass, which leverages the sounddevice and soundfile python packages to manage microphones and record multi-channel audio data. This class allows to record sound in two ways : (1) as a continuous signal directly written in a file over an extended period of time (e.g. an episode) or (2) as smaller 'audio chunks', which provide a snapshot of the past few second (e.g. for policy inputs). Both recording methods are running in a separate thread which should not impact performance. - An integration of the
Microphoneclass in theRobotclass to easily define microphones and link them to cameras. By doing so, audio will be saved along with the recorded video in a single MP4 file; - An integration of the audio modality in LeRobotDataset, as a specific feature stored in a separated file (MP4 for video+audio, M4A for audio). Similarly to video files, audio files are encoded using
ffmpeg. For now, only torchaudio decoding is supported (torchcodec decoding is a work in progress). Once decoded, 'audio chunks' of fixed length may be queried based on a user given timestamp; - A new set of dedicated tests.
How it was tested
The feature was tested on macOS Sequoia with a SO100 setup with two cameras (laptop and phone), and two microphones (laptop and headset). One of the microphones was linked to a camera, and the other one treated as a standalone sensor. See the SO100 entry in the config.py for more details.
How to checkout & try?
Assuming a SO100 setup
- Install audio dependencies
pip install -e ".[audio]"
- Query microphones indices and update the robot
config.pyfile
python -m sounddevice
The microphones have at least one input channel
You may get the microphone sample rate as follows:
python -c "import sounddevice as sd; print(sd.query_devices(<microphone_index>)['default_samplerate'])"
- Run a dummy dataset recording
python lerobot/scripts/control_robot.py \
--robot.type=so100 \
--control.type=record \
--control.fps=30 \
--control.single_task="Hear some sound." \
--control.repo_id=so100/sound_test \
--control.tags='["so100","sound"]' \
--control.warmup_time_s=2 \
--control.episode_time_s=5 \
--control.reset_time_s=1 \
--control.num_episodes=1 \
--control.push_to_hub=false
You should be able to check the recorded data in your local dataset folder.
⚠️ Running tests highlighted that tochaudio (used for audio decoding, as torchcodec does not officially supports it yet) if not compatible with ffmpeg > 7, which is now the recommended version.
There is a working fix on macOS which relies on installing ffmpeg with :
conda install ffmpeg=6.1.2 -c conda-forge
-> No conflict with the latest versions of av or torchvision (which must encapsulate their own ffmpeg version);
-> Compatible with torchaudio (<7) and torchcodec (installed with conda);
-> Already lists libsvtav1 among its encoders.