Audio dataset is not decoding on 4.1.1
Describe the bug
The audio column remain as non-decoded objects even when accessing them.
dataset = load_dataset("MrDragonFox/Elise", split = "train")
dataset[0] # see that it doesn't show 'array' etc...
Works fine with datasets==3.6.0
Followed the docs in
- https://huggingface.co/docs/datasets/en/audio_load
Steps to reproduce the bug
dataset = load_dataset("MrDragonFox/Elise", split = "train")
dataset[0] # see that it doesn't show 'array' etc...
Expected behavior
It should decode when accessing the elemenet
Environment info
4.1.1 ubuntu 22.04
Related
- https://github.com/huggingface/datasets/issues/7707
Previously (datasets<=3.6.0), audio columns were decoded automatically when accessing a row. Now, for performance reasons, audio decoding is lazy by default: you just see the file path unless you explicitly cast the column to Audio.
Here’s the fix (following the current datasets audio docs ):
from datasets import load_dataset, Audio
dataset = load_dataset("MrDragonFox/Elise", split="train")
# Explicitly decode the audio column
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
print(dataset[0]["audio"])
# {'path': '...', 'array': array([...], dtype=float32), 'sampling_rate': 16000}
@haitam03-yo's comment is right that the data is not decoded by default anymore indeed, but here is how it works in practice now:
From datasets v4, audio data are read as AudioDecoder objects from torchcodec. This doesn't decode the data by default, but you can call audio.get_all_samples() to decode the audio.
See the documentation on how to process audio data here: https://huggingface.co/docs/datasets/audio_process
To resolve this, you need to explicitly cast the audio column to the Audio feature. This will decode the audio data and make it accessible as an array. Here is the corrected code snippet
from datasets import load_dataset, Audio
Load your dataset
dataset = load_dataset("MrDragonFox/Elise", split="train")
Explicitly cast the 'audio' column to the Audio feature
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
Now you can access the decoded audio array
print(dataset[0]["audio"])
By adding the cast_column step, you are telling the datasets library to decode the audio data with the specified sampling rate, and you will then be able to access the audio array as you were used to in previous versions.