dataset-viewer icon indicating copy to clipboard operation
dataset-viewer copied to clipboard

opus decoding error

Open polinaeterna opened this issue 2 years ago • 16 comments

see https://huggingface.co/datasets/stable-speech/mls_eng_10k/discussions/1#65ef6e9d440a5fc3d94a40ad

To fix this maybe we should pin soundfile library to >=1.0.31 (first version that supported opus) like we do in datasets library.

polinaeterna avatar Mar 13 '24 12:03 polinaeterna

cc @severo

(I didn't even know that we rely on ffmpeg, if i remember correctly, it wasn't working well with opus a year ago. )

polinaeterna avatar Mar 13 '24 12:03 polinaeterna

previous issue: https://github.com/huggingface/datasets-server/issues/194

severo avatar Mar 13 '24 14:03 severo

See also https://github.com/huggingface/datasets-server/pull/608/commits/dadb8f207d826ce4d37f0e51ebaaeaa474ab0379, where we replaced compilation of libsndfile1 with installation from the repositories, since the distributed version at that point supported opus.

severo avatar Mar 13 '24 15:03 severo

Currently, the workers have:

I have no name!@prod-datasets-server-worker-medium-cbf4c8846-4vm5r:/src/services/worker$ apt-cache show libsndfile1
Package: libsndfile1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 574
Maintainer: Debian Multimedia Maintainers <[email protected]>
Architecture: amd64
Multi-Arch: same
Source: libsndfile
Version: 1.2.0-1
Depends: libc6 (>= 2.33), libflac12 (>= 1.3.0), libmp3lame0 (>= 3.100), libmpg123-0 (>= 1.28.0), libogg0 (>= 1.3.0), libopus0 (>= 1.1), libvorbis0a (>= 1.2.3), libvorbisenc2 (>= 1.1.2)
Description: Library for reading/writing audio files
 libsndfile is a library of C routines for reading and writing files containing
 sampled audio data.
 .
 Various versions of WAV (integer, floating point, GSM, and compressed formats);
 Microsoft PCM, A-law and u-law formats; AIFF, AIFC and RIFX; various AU/SND
 formats (Sun/NeXT, Dec AU, G721 and G723 ADPCM); RAW header-less PCM files;
 Amiga IFF/8SVX/16SV PCM files; Ensoniq PARIS  (.PAF); Apple's Core Audio Format
 (CAF) and others.
Description-md5: 67b723b50c9aa944fba48e79d51e9d5c
Homepage: http://www.mega-nerd.com/libsndfile/

and in their python interpreter:

>>> IS_OPUS_SUPPORTED = importlib.util.find_spec("soundfile") is not None and version.parse(
...     importlib.import_module("soundfile").__libsndfile_version__
... ) >= version.parse("1.0.31")
>>> IS_OPUS_SUPPORTED
True

severo avatar Mar 13 '24 15:03 severo

So: it does not seem to be an issue with the versions of soundfile / libsndfile1.

I unassign myself, as I have no particular clues about it. @huggingface/datasets-server feel free to pick it.

severo avatar Mar 14 '24 10:03 severo

Any updates on this ?

ylacombe avatar Apr 05 '24 08:04 ylacombe

Up! It would be very helpful to have this!

ylacombe avatar Apr 15 '24 13:04 ylacombe

Hello, it seems there was an issue with the new 0.12 OPUS format that had to be fixed on the libsndfile.

  • See issue: https://github.com/libsndfile/libsndfile/issues/888
  • See fixing PR: https://github.com/libsndfile/libsndfile/pull/926

EDIT: The fix was released on libsndfile-1.2.1 (~~libsndfile-1.2.2~~).

Therefore, I think the only solution in our part would be to force using libsndfile-1.2.2 (now we use 1.2.0).

  • The problem is that this version is not officially distributed by Linux package manager yet.
    • @polinaeterna do you remember if we can ask them to update their distributed version, as the last time we had an analogue issue? If I remember correctly, I think you contacted them.

albertvillanova avatar Apr 15 '24 13:04 albertvillanova

A painful alternative is to build it from source, as we were doing before (it's in the git history)

severo avatar Apr 15 '24 13:04 severo

Yes, if we have no other option... :disappointed:

albertvillanova avatar Apr 15 '24 13:04 albertvillanova

Maybe we are lucky? :crossed_fingers: https://launchpad.net/ubuntu/+source/libsndfile/1.2.2-1ubuntu5

albertvillanova avatar Apr 15 '24 14:04 albertvillanova

Could we check? @severo I don't know how...

albertvillanova avatar Apr 15 '24 14:04 albertvillanova

Maybe there is nothing to do, because we don't pin the libsndfile1 version: https://github.dev/huggingface/dataset-viewer/blob/c355f4bbbd82a62abbb5a7d927e403659c18ebbd/services/worker/Dockerfile#L19-L20

Maybe add the problematic opus file to the services/worker tests, and ensure we can process it


or in the e2e, where we're sure the installed libsndfile1 package will be the same in production

severo avatar Apr 15 '24 14:04 severo

Unfortunately, the problem persists.

albertvillanova avatar Apr 16 '24 06:04 albertvillanova

A snippet from one of my older dockerfiles where I had to build libsndfile from source in case it helps resolve this issue:

FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04

# requirements for libsndfile build
RUN apt update && DEBIAN_FRONTEND=noninteractive \
    apt install -qqy pkg-config libtool autoconf autogen automake build-essential libasound2-dev \
  libflac-dev libogg-dev libtool libvorbis-dev libopus-dev libmp3lame-dev \
  libmpg123-dev pkg-config python

# build libsndfile
RUN git clone https://github.com/libsndfile/libsndfile.git
WORKDIR /libsndfile/
RUN git checkout e5ee50fbda1b9049a45fc65d06c34825feb4f237
RUN autoreconf -vif
RUN ./configure --enable-werror
RUN autoreconf -vif
RUN make
RUN make install

# I do not recall why this was necessary
RUN mkdir /usr/local/lib/python3.8/dist-packages/_soundfile_data/
RUN cp /usr/local/lib/libsndfile.* /usr/local/lib/python3.8/dist-packages/_soundfile_data/

brthor avatar May 01 '24 17:05 brthor

Thanks. It's also what we were doing before:

https://github.com/huggingface/dataset-viewer/blob/a22b5fd967ff3cc0c0d52615dfd73455a73b966d/services/worker/Dockerfile#L16-L33

Should we restore this @polinaeterna @albertvillanova?

severo avatar May 02 '24 11:05 severo

Hey there, any news on this? let me know if I can help!

ylacombe avatar May 09 '24 12:05 ylacombe

Hi, I can restore the building of libsndfile from source as we did before.

At least, until officially distributed by Linux package manager... See my comment above https://github.com/huggingface/dataset-viewer/issues/2584#issuecomment-2056921414

albertvillanova avatar May 10 '24 09:05 albertvillanova

After a more detailed investigation and when trying to implement a test, I discovered that the error is not caused by libsndfile1 but by pydub when it calls ffmpeg. See insightful comment by @severo: https://huggingface.co/datasets/parler-tts/mls_eng_10k/discussions/1#65ef6e9d440a5fc3d94a40ad

The reason is:

pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 
...
Unknown input format: 'opus'

I am opening another PR with a test that raises that error and with a proposed fix. Let's see if that fixes the issue.

albertvillanova avatar May 15 '24 15:05 albertvillanova

To get the list of affected datasets in the database:

db.cachedResponsesBlue.aggregate([
  {
    $match: {
      error_code: "RowsPostProcessingError",
      "details.cause_exception": "CouldntDecodeError",
      "details.cause_message": { $regex: "Unknown input format: 'opus'" },
    },
  },
  {
    $group: {
      _id: "$dataset",
    },
  },
]);

I refreshed all of them

severo avatar May 16 '24 10:05 severo

Finally, I refreshed all of:

{
      error_code: "RowsPostProcessingError",
      "details.cause_exception": "CouldntDecodeError",
}

which is a bit more than 500 datasets. Nearly all the errors have been fixed (though it's still in progress). Still some Invalid data found when processing input

severo avatar May 16 '24 11:05 severo