opus decoding error
see https://huggingface.co/datasets/stable-speech/mls_eng_10k/discussions/1#65ef6e9d440a5fc3d94a40ad
To fix this maybe we should pin soundfile library to >=1.0.31 (first version that supported opus) like we do in datasets library.
cc @severo
(I didn't even know that we rely on ffmpeg, if i remember correctly, it wasn't working well with opus a year ago. )
previous issue: https://github.com/huggingface/datasets-server/issues/194
See also https://github.com/huggingface/datasets-server/pull/608/commits/dadb8f207d826ce4d37f0e51ebaaeaa474ab0379, where we replaced compilation of libsndfile1 with installation from the repositories, since the distributed version at that point supported opus.
Currently, the workers have:
I have no name!@prod-datasets-server-worker-medium-cbf4c8846-4vm5r:/src/services/worker$ apt-cache show libsndfile1
Package: libsndfile1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 574
Maintainer: Debian Multimedia Maintainers <[email protected]>
Architecture: amd64
Multi-Arch: same
Source: libsndfile
Version: 1.2.0-1
Depends: libc6 (>= 2.33), libflac12 (>= 1.3.0), libmp3lame0 (>= 3.100), libmpg123-0 (>= 1.28.0), libogg0 (>= 1.3.0), libopus0 (>= 1.1), libvorbis0a (>= 1.2.3), libvorbisenc2 (>= 1.1.2)
Description: Library for reading/writing audio files
libsndfile is a library of C routines for reading and writing files containing
sampled audio data.
.
Various versions of WAV (integer, floating point, GSM, and compressed formats);
Microsoft PCM, A-law and u-law formats; AIFF, AIFC and RIFX; various AU/SND
formats (Sun/NeXT, Dec AU, G721 and G723 ADPCM); RAW header-less PCM files;
Amiga IFF/8SVX/16SV PCM files; Ensoniq PARIS (.PAF); Apple's Core Audio Format
(CAF) and others.
Description-md5: 67b723b50c9aa944fba48e79d51e9d5c
Homepage: http://www.mega-nerd.com/libsndfile/
and in their python interpreter:
>>> IS_OPUS_SUPPORTED = importlib.util.find_spec("soundfile") is not None and version.parse(
... importlib.import_module("soundfile").__libsndfile_version__
... ) >= version.parse("1.0.31")
>>> IS_OPUS_SUPPORTED
True
So: it does not seem to be an issue with the versions of soundfile / libsndfile1.
I unassign myself, as I have no particular clues about it. @huggingface/datasets-server feel free to pick it.
Any updates on this ?
Up! It would be very helpful to have this!
Hello, it seems there was an issue with the new 0.12 OPUS format that had to be fixed on the libsndfile.
- See issue: https://github.com/libsndfile/libsndfile/issues/888
- See fixing PR: https://github.com/libsndfile/libsndfile/pull/926
EDIT: The fix was released on libsndfile-1.2.1 (~~libsndfile-1.2.2~~).
Therefore, I think the only solution in our part would be to force using libsndfile-1.2.2 (now we use 1.2.0).
- The problem is that this version is not officially distributed by Linux package manager yet.
- @polinaeterna do you remember if we can ask them to update their distributed version, as the last time we had an analogue issue? If I remember correctly, I think you contacted them.
A painful alternative is to build it from source, as we were doing before (it's in the git history)
Yes, if we have no other option... :disappointed:
Maybe we are lucky? :crossed_fingers: https://launchpad.net/ubuntu/+source/libsndfile/1.2.2-1ubuntu5
Could we check? @severo I don't know how...
Maybe there is nothing to do, because we don't pin the libsndfile1 version: https://github.dev/huggingface/dataset-viewer/blob/c355f4bbbd82a62abbb5a7d927e403659c18ebbd/services/worker/Dockerfile#L19-L20
Maybe add the problematic opus file to the services/worker tests, and ensure we can process it
or in the e2e, where we're sure the installed libsndfile1 package will be the same in production
Unfortunately, the problem persists.
A snippet from one of my older dockerfiles where I had to build libsndfile from source in case it helps resolve this issue:
FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04
# requirements for libsndfile build
RUN apt update && DEBIAN_FRONTEND=noninteractive \
apt install -qqy pkg-config libtool autoconf autogen automake build-essential libasound2-dev \
libflac-dev libogg-dev libtool libvorbis-dev libopus-dev libmp3lame-dev \
libmpg123-dev pkg-config python
# build libsndfile
RUN git clone https://github.com/libsndfile/libsndfile.git
WORKDIR /libsndfile/
RUN git checkout e5ee50fbda1b9049a45fc65d06c34825feb4f237
RUN autoreconf -vif
RUN ./configure --enable-werror
RUN autoreconf -vif
RUN make
RUN make install
# I do not recall why this was necessary
RUN mkdir /usr/local/lib/python3.8/dist-packages/_soundfile_data/
RUN cp /usr/local/lib/libsndfile.* /usr/local/lib/python3.8/dist-packages/_soundfile_data/
Thanks. It's also what we were doing before:
https://github.com/huggingface/dataset-viewer/blob/a22b5fd967ff3cc0c0d52615dfd73455a73b966d/services/worker/Dockerfile#L16-L33
Should we restore this @polinaeterna @albertvillanova?
Hey there, any news on this? let me know if I can help!
Hi, I can restore the building of libsndfile from source as we did before.
At least, until officially distributed by Linux package manager... See my comment above https://github.com/huggingface/dataset-viewer/issues/2584#issuecomment-2056921414
After a more detailed investigation and when trying to implement a test, I discovered that the error is not caused by libsndfile1 but by pydub when it calls ffmpeg. See insightful comment by @severo: https://huggingface.co/datasets/parler-tts/mls_eng_10k/discussions/1#65ef6e9d440a5fc3d94a40ad
The reason is:
pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code:
...
Unknown input format: 'opus'
I am opening another PR with a test that raises that error and with a proposed fix. Let's see if that fixes the issue.
To get the list of affected datasets in the database:
db.cachedResponsesBlue.aggregate([
{
$match: {
error_code: "RowsPostProcessingError",
"details.cause_exception": "CouldntDecodeError",
"details.cause_message": { $regex: "Unknown input format: 'opus'" },
},
},
{
$group: {
_id: "$dataset",
},
},
]);
I refreshed all of them
Finally, I refreshed all of:
{
error_code: "RowsPostProcessingError",
"details.cause_exception": "CouldntDecodeError",
}
which is a bit more than 500 datasets. Nearly all the errors have been fixed (though it's still in progress). Still some Invalid data found when processing input