dataset-viewer opus decoding error

see https://huggingface.co/datasets/stable-speech/mls_eng_10k/discussions/1#65ef6e9d440a5fc3d94a40ad

To fix this maybe we should pin soundfile library to >=1.0.31 (first version that supported opus) like we do in datasets library.

Mar 13 '24 12:03 polinaeterna

cc @severo

(I didn't even know that we rely on ffmpeg, if i remember correctly, it wasn't working well with opus a year ago. )

Mar 13 '24 12:03 polinaeterna

previous issue: https://github.com/huggingface/datasets-server/issues/194

Mar 13 '24 14:03 severo

See also https://github.com/huggingface/datasets-server/pull/608/commits/dadb8f207d826ce4d37f0e51ebaaeaa474ab0379, where we replaced compilation of libsndfile1 with installation from the repositories, since the distributed version at that point supported opus.

Mar 13 '24 15:03 severo

Currently, the workers have:

I have no name!@prod-datasets-server-worker-medium-cbf4c8846-4vm5r:/src/services/worker$ apt-cache show libsndfile1
Package: libsndfile1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 574
Maintainer: Debian Multimedia Maintainers <[email protected]>
Architecture: amd64
Multi-Arch: same
Source: libsndfile
Version: 1.2.0-1
Depends: libc6 (>= 2.33), libflac12 (>= 1.3.0), libmp3lame0 (>= 3.100), libmpg123-0 (>= 1.28.0), libogg0 (>= 1.3.0), libopus0 (>= 1.1), libvorbis0a (>= 1.2.3), libvorbisenc2 (>= 1.1.2)
Description: Library for reading/writing audio files
 libsndfile is a library of C routines for reading and writing files containing
 sampled audio data.
 .
 Various versions of WAV (integer, floating point, GSM, and compressed formats);
 Microsoft PCM, A-law and u-law formats; AIFF, AIFC and RIFX; various AU/SND
 formats (Sun/NeXT, Dec AU, G721 and G723 ADPCM); RAW header-less PCM files;
 Amiga IFF/8SVX/16SV PCM files; Ensoniq PARIS  (.PAF); Apple's Core Audio Format
 (CAF) and others.
Description-md5: 67b723b50c9aa944fba48e79d51e9d5c
Homepage: http://www.mega-nerd.com/libsndfile/

and in their python interpreter:

>>> IS_OPUS_SUPPORTED = importlib.util.find_spec("soundfile") is not None and version.parse(
...     importlib.import_module("soundfile").__libsndfile_version__
... ) >= version.parse("1.0.31")
>>> IS_OPUS_SUPPORTED
True

Mar 13 '24 15:03 severo

So: it does not seem to be an issue with the versions of soundfile / libsndfile1.

I unassign myself, as I have no particular clues about it. @huggingface/datasets-server feel free to pick it.

Mar 14 '24 10:03 severo

Any updates on this ?

Apr 05 '24 08:04 ylacombe

Up! It would be very helpful to have this!

Apr 15 '24 13:04 ylacombe

Hello, it seems there was an issue with the new 0.12 OPUS format that had to be fixed on the libsndfile.

See issue: https://github.com/libsndfile/libsndfile/issues/888
See fixing PR: https://github.com/libsndfile/libsndfile/pull/926

EDIT: The fix was released on libsndfile-1.2.1 (~~libsndfile-1.2.2~~).

Therefore, I think the only solution in our part would be to force using libsndfile-1.2.2 (now we use 1.2.0).

The problem is that this version is not officially distributed by Linux package manager yet.
- @polinaeterna do you remember if we can ask them to update their distributed version, as the last time we had an analogue issue? If I remember correctly, I think you contacted them.

Apr 15 '24 13:04 albertvillanova

A painful alternative is to build it from source, as we were doing before (it's in the git history)

Apr 15 '24 13:04 severo

Yes, if we have no other option... :disappointed:

Apr 15 '24 13:04 albertvillanova

Maybe we are lucky? :crossed_fingers: https://launchpad.net/ubuntu/+source/libsndfile/1.2.2-1ubuntu5

Apr 15 '24 14:04 albertvillanova

Could we check? @severo I don't know how...

Apr 15 '24 14:04 albertvillanova

Maybe there is nothing to do, because we don't pin the libsndfile1 version: https://github.dev/huggingface/dataset-viewer/blob/c355f4bbbd82a62abbb5a7d927e403659c18ebbd/services/worker/Dockerfile#L19-L20

Maybe add the problematic opus file to the services/worker tests, and ensure we can process it

or in the e2e, where we're sure the installed libsndfile1 package will be the same in production

Apr 15 '24 14:04 severo

Unfortunately, the problem persists.

Apr 16 '24 06:04 albertvillanova

A snippet from one of my older dockerfiles where I had to build libsndfile from source in case it helps resolve this issue:

FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04

# requirements for libsndfile build
RUN apt update && DEBIAN_FRONTEND=noninteractive \
    apt install -qqy pkg-config libtool autoconf autogen automake build-essential libasound2-dev \
  libflac-dev libogg-dev libtool libvorbis-dev libopus-dev libmp3lame-dev \
  libmpg123-dev pkg-config python

# build libsndfile
RUN git clone https://github.com/libsndfile/libsndfile.git
WORKDIR /libsndfile/
RUN git checkout e5ee50fbda1b9049a45fc65d06c34825feb4f237
RUN autoreconf -vif
RUN ./configure --enable-werror
RUN autoreconf -vif
RUN make
RUN make install

# I do not recall why this was necessary
RUN mkdir /usr/local/lib/python3.8/dist-packages/_soundfile_data/
RUN cp /usr/local/lib/libsndfile.* /usr/local/lib/python3.8/dist-packages/_soundfile_data/

May 01 '24 17:05 brthor

Thanks. It's also what we were doing before:

https://github.com/huggingface/dataset-viewer/blob/a22b5fd967ff3cc0c0d52615dfd73455a73b966d/services/worker/Dockerfile#L16-L33

Should we restore this @polinaeterna @albertvillanova?

May 02 '24 11:05 severo

Hey there, any news on this? let me know if I can help!

May 09 '24 12:05 ylacombe

Hi, I can restore the building of libsndfile from source as we did before.

At least, until officially distributed by Linux package manager... See my comment above https://github.com/huggingface/dataset-viewer/issues/2584#issuecomment-2056921414

May 10 '24 09:05 albertvillanova

After a more detailed investigation and when trying to implement a test, I discovered that the error is not caused by libsndfile1 but by pydub when it calls ffmpeg. See insightful comment by @severo: https://huggingface.co/datasets/parler-tts/mls_eng_10k/discussions/1#65ef6e9d440a5fc3d94a40ad

The reason is:

pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 
...
Unknown input format: 'opus'

I am opening another PR with a test that raises that error and with a proposed fix. Let's see if that fixes the issue.

May 15 '24 15:05 albertvillanova

To get the list of affected datasets in the database:

db.cachedResponsesBlue.aggregate([
  {
    $match: {
      error_code: "RowsPostProcessingError",
      "details.cause_exception": "CouldntDecodeError",
      "details.cause_message": { $regex: "Unknown input format: 'opus'" },
    },
  },
  {
    $group: {
      _id: "$dataset",
    },
  },
]);

I refreshed all of them

May 16 '24 10:05 severo

Finally, I refreshed all of:

{
      error_code: "RowsPostProcessingError",
      "details.cause_exception": "CouldntDecodeError",
}

which is a bit more than 500 datasets. Nearly all the errors have been fixed (though it's still in progress). Still some Invalid data found when processing input

May 16 '24 11:05 severo