datasets [Audio] Path of Common Voice cannot be used for audio loading anymore

Describe the bug

Steps to reproduce the bug

from datasets import load_dataset
from torchaudio import load

ds = load_dataset("common_voice", "ab", split="train")

# both of the following commands fail at the moment
load(ds[0]["audio"]["path"])
load(ds[0]["path"])

Expected results

The path should be the complete absolute path to the downloaded audio file not some relative path.

Actual results

~/hugging_face/venv_3.9/lib/python3.9/site-packages/torchaudio/backend/sox_io_backend.py in load(filepath, frame_offset, num_frames, normalize, channels_first, format)
    150                 filepath, frame_offset, num_frames, normalize, channels_first, format)
    151         filepath = os.fspath(filepath)
--> 152     return torch.ops.torchaudio.sox_io_load_audio_file(
    153         filepath, frame_offset, num_frames, normalize, channels_first, format)
    154

RuntimeError: Error loading audio file: failed to open file cv-corpus-6.1-2020-12-11/ab/clips/common_voice_ab_19904194.mp3

Environment info

datasets version: 1.18.3.dev0
Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.27
Python version: 3.9.1
PyArrow version: 3.0.0

Feb 01 '22 18:02 patrickvonplaten

Having talked to @lhoestq, I see that this feature is no longer supported.

I really don't think this was a good idea. It is a major breaking change and one for which we don't even have a working solution at the moment, which is bad for PyTorch as we don't want to force people to have datasets decode audio files automatically, but really bad for Tensorflow and Flax where we currently cannot even use datasets to load .mp3 files - e.g. common_voice doesn't work anymore in a TF training script. Note this worked perfectly fine before making the change (think it was done here no?)

IMO, it's really important to think about a solution here and I strongly favor to make a difference here between loading a dataset in streaming mode and in non-streaming mode, so that in non-streaming mode the actual downloaded file is displayed. It's really crucial for people to be able to analyse the original files IMO when the dataset is not downloaded in streaming mode.

There are the following reasons why it is paramount to have access to the original audio file in my opinion (in non-streaming mode):

There are a wide variety of different libraries to load audio data with varying support on different platforms. For me it was quite clear that there is simply to single good library to load audio files for all platforms - so we have to leave the option to the user to decide which loading to use.
We had support for audio datasets a long time before streaming audio was possible. There were quite some versions where we advertised everywhere to load the audio from the path name (and there are many places where we still do even though it's not possible anymore). To give some examples:
- Official example of TF Wav2Vec2: https://github.com/huggingface/transformers/blob/f427e750490b486944cc9be3c99834ad5cf78b57/src/transformers/models/wav2vec2/modeling_tf_wav2vec2.py#L1423 Wav2Vec2 is as important for speech as BERT is for NLP - so it's very important. The official example currently doesn't work and we don't even have a workaround for it for MP3 files at the moment. Same goes for Flax.
- The most downloaded non-nlp checkpoint: https://huggingface.co/facebook/wav2vec2-base-960h#usage has a usage example which doesn't work anymore with the current datasets implementation. I'll update this now, but we have >1000 wav2vec2 checkpoints on the Hub and we can't update all the model cards. => This is a big breaking change with no current solution. For transformers breaking changes are one of the biggest complaints.
Similar to this we also shouldn't assume that there is only one resampling method for Audio. I think it's good to have one offered automatically by datasets, but we have to leave the user the freedom to choose her/his own resampling as well. Resampling can take very different filtering windows and other parameters which are currently somewhat hardcoded in datasets, which users might very well want to change.

=> IMO, it's a very big priority to again have the correct absolute path in non-streaming mode. The other solution of providing a path-like object derived from the bytes stocked in the .array file is not nearly as user-friendly, but better than nothing.

Feb 01 '22 20:02 patrickvonplaten

Agree that we need to have access to the original sound files. Few days ago I was looking for these original files because I suspected there is bug in the audio resampling (confirmed in https://github.com/huggingface/datasets/issues/3662) and I want to do my own resampling to workaround the bug, which is now not possible anymore due to the unavailability of the original files.

Feb 01 '22 21:02 cahya-wirawan

@patrickvonplaten

The other solution of providing a path-like object derived from the bytes stocked in the .array file is not nearly as user-friendly, but better than nothing

Just to clarify, here you describe the approach that uses the Audio.decode attribute to access the underlying bytes?

The official example currently doesn't work and we don't even have a workaround for it for MP3 files at the moment

I'd assume this is because we use sox_io as a backend for decoding. However, soon we should be able to use soundfile, which supports path-like objects, for MP3 (https://github.com/huggingface/datasets/pull/3667#issuecomment-1030090627).

Your concern is reasonable, but there are situations where we can only serve bytes (see https://github.com/huggingface/datasets/pull/3685 for instance). IMO it makes sense to fix the affected datasets for now, but I don't think we should care too much whether we rely on local paths or bytes after soundfile adds support for MP3 as long as our examples work (shouldn't be too hard to update the map_to_array functions) and we properly document how to access the underlying path/bytes for custom decoding (via ds.cast_column("audio", Audio(decode=False))).

Feb 07 '22 18:02 mariosasko

Related to this discussion: in https://github.com/huggingface/datasets/pull/3664#issuecomment-1031866858 I propose how we could change iter_archive to work for streaming and also return local paths (as it used too !). I'd love your opinions on this

Feb 07 '22 20:02 lhoestq

@patrickvonplaten

The other solution of providing a path-like object derived from the bytes stocked in the .array file is not nearly as user-friendly, but better than nothing

Just to clarify, here you describe the approach that uses the Audio.decode attribute to access the underlying bytes?

Yes!

The official example currently doesn't work and we don't even have a workaround for it for MP3 files at the moment

I'd assume this is because we use sox_io as a backend for decoding. However, soon we should be able to use soundfile, which supports path-like objects, for MP3 (#3667 (comment)). Your concern is reasonable, but there are situations where we can only serve bytes (see #3685 for instance). IMO it makes sense to fix the affected datasets for now, but I don't think we should care too much whether we rely on local paths or bytes after soundfile adds support for MP3 as long as our examples work (shouldn't be too hard to update the map_to_array functions) and we properly document how to access the underlying path/bytes for custom decoding (via ds.cast_column("audio", Audio(decode=False))).

Yes this might be, but I highly doubt that soundfile is the go-to library for audio then. @anton-l and I have tried out a bunch of different audio loading libraries (soundfile, librosa, torchaudio, pure ffmpeg, audioread, ...). One thing that was pretty clear to me is that there is just no "de-facto standard" library and they all have pros and cons. None of the libraries really supports "batch"-ed audio loading. Some depend on PyTorch. torchaudio is 100x faster (really!) than librosa's fallback on MP3. torchaudio often has problems with multi-proessing, ... Also we should keep in mind that resampling is similarly not as simple as reading a text file. It's a pretty complex signal processing transform and people very well might want to use special filters, etc...at the moment we just hard-code torchaudio's or librosa's default filter when doing resampling.

=> All this to say that we should definitely care about whether we rely on local paths or bytes IMO. We don't want to loose all users that are forced to use datasets decoding or resampling or have to built a very much not intuitive way of loading bytes into a numpy array. It's much more intuitive to be able to inspect a local file. I feel pretty strongly about this and am happy to also jump on a call. Keeping libraries flexible and lean as well as exposing internals is very important IMO (this philosophy has worked quite well so far with Transformers).

Feb 08 '22 15:02 patrickvonplaten

Thanks a lot for the very detailed explanation. Now everything makes much more sense.

Feb 08 '22 16:02 mariosasko

From https://github.com/huggingface/datasets/pull/3736 the Common Voice dataset now gives access to the local audio files as before

Feb 22 '22 09:02 lhoestq

I understand the argument that it is bad to have a breaking change. How to deal with the introduction of breaking changes is a topic of its own and not sure how you want to deal with that (or is the policy this is never allowed, and there must be a load_dataset_v2 or so if you really want to introduce a breaking change?).

Regardless of whether it is a breaking change, however, I don't see the other arguments.

but really bad for Tensorflow and Flax where we currently cannot even use datasets to load .mp3 files

I don't exactly understand this. Why not?

Why does the HF dataset on-the-fly decoding mechanism not work? Why is it anyway specific to PyTorch or TensorFlow? Isn't this independent?

But even if you just provide the raw bytes to TF, on TF you could just use sth like tfio.audio.decode_mp3 or tf.audio.decode_ogg or tfio.audio.decode_flac?

There are the following reasons why it is paramount to have access to the original audio file in my opinion ...

I don't really understand the arguments (despite that it maybe breaks existing code). You anyway have the original audio files but it is just embedded in the dataset? I don't really know about any library which cannot also load the audio from memory (i.e. from the dataset).

Btw, on librosa being slow for decoding audio files, I saw that as well, so we have this comment RETURNN:

Don't use librosa.load which internally uses audioread which would use Gstreamer as a backend which has multiple issues: https://github.com/beetbox/audioread/issues/62 https://github.com/beetbox/audioread/issues/63 Instead, use PySoundFile (soundfile), which is also faster. See here for discussions: https://github.com/beetbox/audioread/issues/64 https://github.com/librosa/librosa/issues/681

Resampling is also a separate aspect, which is also less straightforward and with different compromises between speed and quality. So there the different tradeoffs and different implementations can make a difference.

However, I don't see how this is related to the question whether there should be the raw bytes inside the dataset or as separate local files.

Apr 21 '22 11:04 albertz

Thanks for your comments here @albertz - cool to get your input!

Answering a bit here between the lines:

I understand the argument that it is bad to have a breaking change. How to deal with the introduction of breaking changes is a topic of its own and not sure how you want to deal with that (or is the policy this is never allowed, and there must be a load_dataset_v2 or so if you really want to introduce a breaking change?).

Regardless of whether it is a breaking change, however, I don't see the other arguments.

but really bad for Tensorflow and Flax where we currently cannot even use datasets to load .mp3 files

I don't exactly understand this. Why not?

Why does the HF dataset on-the-fly decoding mechanism not work? Why is it anyway specific to PyTorch or TensorFlow? Isn't this independent?

The problem with decoding on the fly is that we currently rely on torchaudio for this now which relies on torch which is not necessarily something people would like to install when using tensorflow or flax. Therefore we cannot just rely on people using the decoding on the fly method. We just didn't find a library that is ML framework independent and fast enough for all formats. torchaudio is currently in our opinion by far the best here.

So for TF and Flax it's important that users can load audio files or bytes they way the want to - this might become less important if we find (or make) a good library with few dependencies that is fast for all kinds of platforms / use cases.

Now the question is whether it's better to store audio data as a path to a file or as raw bytes I guess.
My main arguments for storing the audio data as a path to a file is pretty much all about users experience - I don't really expect our users to understand the inner workings of datasets:

1. It's not straightforward to know which function to use to decode it - not all load_audio(...) or read_audio(...) work on raw bytes. E.g. Looking at https://pytorch.org/audio/stable/torchaudio.html?highlight=load#torchaudio.load one would not see directly how to load raw bytes . There are also some functions of other libraries which only work on files which would require the user to save the bytes as a file first before being able to load it.
1. It's difficult to see which format the bytes are coming from (mp3, ogg, ...) - guess this could be remedied by adding the format to each sample though
1. It is a bit scary IMO to see raw bytes for users. Overall, I think it's better to leave the data in it's raw form as this way it's much easier for people to play around with the audio files, less need to read docs because people don't worry about what happened to the audio files (are the bytes already resampled?)

But the argument that the audio should be loadable directly from memory is good - haven't thought about this too much. I guess it's still very much possible for the user to do this:

def save_as_bytes:
      batch["bytes"] = read_in_bytes_from_file(batch["file"])\
      os.remove(batch["file"])

ds = ds.map(save_as_bytes)

ds.save_to_disk(...)

Guess the question is more a bit about what should be the default case?

Apr 21 '22 11:04 patrickvonplaten

The problem with decoding on the fly is that we currently rely on torchaudio for this now which relies on torch which is not necessarily something people would like to install when using tensorflow or flax. Therefore we cannot just rely on people using the decoding on the fly method. We just didn't find a library that is ML framework independent and fast enough for all formats. torchaudio is currently in our opinion by far the best here.

But how is this relevant for this issue here? I thought this issue here is about having the (correct) path in the dataset or having raw bytes in the dataset.

How did TF users use it at all then? Or they just do not use on-the-fly decoding? I did not even notice this problem (maybe because I had torchaudio installed). But what do they use instead?

But as I outlined before, they could just use tfio.audio.decode_flac and co, where it would be more natural if you already provide the raw bytes.

Looking at https://pytorch.org/audio/stable/torchaudio.html?highlight=load#torchaudio.load one would not see directly how to load raw bytes

I was not really familiar with torchaudio. It seems that they really don't provide an easy/direct API to operate on raw bytes. Which is very strange and unfortunate because as far as I can see, all the underlying backend libraries (e.g. soundfile) easily allow that. So I would say that this is the fault of torchaudio then. But despite, if you anyway use torchaudio with soundfile backend, why not just use soundfile directly. It's very simple to use and crossplatform.

But ok, now we are just discussing how to handle the on-the-fly decoding. I still think this is a separate issue and having raw bytes in the dataset instead of local files should just be fine as well.

It is a bit scary IMO to see raw bytes for users.

I think nobody who writes code is scared by seeing the raw bytes content of a binary file. :)

I guess it's still very much possible for the user to do this:

def save_as_bytes:
      batch["bytes"] = read_in_bytes_from_file(batch["file"])\
      os.remove(batch["file"])

ds = ds.map(save_as_bytes)

ds.save_to_disk(...)

In https://github.com/huggingface/datasets/pull/4184#issuecomment-1105191639, you said/proposed that this map is not needed anymore and save_to_disk could do it automatically (maybe via some option)?

Guess the question is more a bit about what should be the default case?

Yea this is up to you. I'm happy as long as we can get it the way we want easily and this is a well supported use case. :)

Apr 21 '22 14:04 albertz

In https://github.com/huggingface/datasets/pull/4184#issuecomment-1105191639, you said/proposed that this map is not needed anymore and save_to_disk could do it automatically (maybe via some option)?

Yes! Should be super easy now see discussion here: https://github.com/rwth-i6/i6_core/issues/257#issuecomment-1105494468

Thanks for the super useful input :-)

Apr 21 '22 19:04 patrickvonplaten

Despite the comments that this has been fixed, I am finding the exact same problem is occurring again (with datasets version 2.3.2)

Jul 13 '22 12:07 DCNemesis

Despite the comments that this has been fixed, I am finding the exact same problem is occurring again (with datasets version 2.3.2)

It appears downgrading to torchaudio 0.11.0 fixed this problem.

Jul 13 '22 18:07 DCNemesis

@DCNemesis, sorry which problem exactly is occuring again? Also cc @lhoestq @polinaeterna here

Aug 23 '22 15:08 patrickvonplaten

@patrickvonplaten @lhoestq @polinaeterna I was unable to load audio from Common Voice using 🤗 with the current version of torchaudio, but downgrading to torchaudio 0.11.0 fixed it. This is probably more of a torch problem than a Hugging Face problem.

Aug 23 '22 16:08 DCNemesis

@DCNemesis that's interesting, could you please share the error message if you still can access it?

Aug 24 '22 13:08 polinaeterna

@polinaeterna I believe it is the same exact error as above. It occurs on other .mp3 sources as well, but the problem is with torchaudio > 0.11.0. I've created a short colab notebook that reproduces the error, and the fix here: https://colab.research.google.com/drive/18wsuwdHwBPN3JkcnhEtk8MUYqF9swuWZ?usp=sharing

Aug 24 '22 14:08 DCNemesis

Hi @DCNemesis,

Your issue was slightly different from the original one in this issue page. Yours seems related to a change in the backend used by torchaudio (ffmpeg instead of sox). Refer to the issue page here:

#4776

Normally, it should be circumvented with the patch made by @polinaeterna in:

#4923

Sep 21 '22 14:09 albertvillanova

I think the original issue reported here was already fixed by:

#3736

Otherwise, feel free to reopen.

Sep 21 '22 14:09 albertvillanova

datasets datasets copied to clipboard

[Audio] Path of Common Voice cannot be used for audio loading anymore

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

datasets
datasets copied to clipboard