audio
audio copied to clipboard
[v0.12] torchaudio.info reports num_frames=0 for MP3
🐛 Describe the bug
First, download a wav
and a mp3
file:
wget https://filesamples.com/samples/audio/wav/sample3.wav
wget https://filesamples.com/samples/audio/mp3/sample3.mp3
Here is a short repro:
import torchaudio
# try reading number of frames
# wav is fine
num_frames = torchaudio.info("sample3.wav").num_frames
assert num_frames == 4664587, num_frames
# opening mp3 works
audio, sr = torchaudio.load("sample3.mp3")
print(audio.shape, sr)
# fetching mp3 info fails silently
num_frames = torchaudio.info("sample3.mp3").num_frames
assert num_frames > 0, num_frames
Running this results in:
/home/chris/tmp/torchaudio-issue/venv/lib/python3.8/site-packages/torchaudio/compliance/kaldi.py:22: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:68.)
EPSILON = torch.tensor(torch.finfo(torch.float).eps)
torch.Size([2, 4664587]) 44100
Traceback (most recent call last):
File "repro.py", line 14, in <module>
assert num_frames > 0, num_frames
AssertionError: 0
Maybe this is a bug introduced with the recent switch to ffmpeg for mp3 handling.
Workaround
A (arguably bad) workaround is to use torchaudio.load
and then calculate the number of frames via the shape of the audio.
Sadly this takes a lot longer than using torchaudio.info
for me.
Versions
Collecting environment information...
PyTorch version: 1.12.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-41-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti
Nvidia driver version: 470.103.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] torch==1.12.0
[pip3] torchaudio==0.12.0
[conda] Could not collect
ffmpeg -version
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil 56. 31.100 / 56. 31.100
libavcodec 58. 54.100 / 58. 54.100
libavformat 58. 29.100 / 58. 29.100
libavdevice 58. 8.100 / 58. 8.100
libavfilter 7. 57.100 / 7. 57.100
libavresample 4. 0. 0 / 4. 0. 0
libswscale 5. 5.100 / 5. 5.100
libswresample 3. 5.100 / 3. 5.100
libpostproc 55. 5.100 / 55. 5.100
Hi @iceychris
Unfortunately this is how it is starting 0.12.
We swapped the mp3 decoder. In my understanding MP3 container does not contain this information in header and originally, libsox was reporting the number of frames by parsing the entire file, but libavformat does not perform this and reports 0. The ffmpeg
command does extra step to estimate the duration of the input audio but not the exact frame numbers. The thing is that we cannot be sure if this estimate performed by ffmpeg is correct, and that is the reason I could not port such feature. I understand that this is very bad for entire user community, but I did not have much choice.
One another possible workaround is to export duration
(say in seconds), and ask you to migrate to StreamReader
based metadata fetching.
I know this kind of BC-breaking change causes terrible UX and I wanted to avoid as much but there is little I could control on this one.
I just ran into the same issue when testing lhotse against the latest PyTorch in https://github.com/lhotse-speech/lhotse/pull/764. I don't mind handling MP3 with StreamReader as a special case. Would the duration info you mentioned be more reliable than ffmpeg estimation? And if yes, could you show an example how to fetch it?
being devils advocate for mp3: this was always an issue with mp3 encoders and they provide different paddings thus it is still a bit luck that the decoder is able to remove the padding completely. IMO its better to prepare the users with a documation hint
would it be possible if the input is an mp3 to have torchaudio.info
internally call torchaudio.load
the file to get its metadata?
would it be possible if the input is an mp3 to have
torchaudio.info
internally calltorchaudio.load
the file to get its metadata?
I think there should be some warning as well. I mean general use of torchaudio in my case(an possible for many) is to grab audio lengths, num frames or/and sample rate fast, loading will definitely slow it down
num_frames
is 0 with opus loaded as BytesIO as well. To reproduce,
Download
wget https://filesamples.com/samples/audio/opus/sample3.opus
Run
import torchaudio
import io
torchaudio.utils.sox_utils.set_buffer_size(16000)
num_frames = torchaudio.info("sample3.opus").num_frames
num_frames_file_obj = torchaudio.info(io.BytesIO(open("sample3.opus", "rb").read())).num_frames
print(num_frames, num_frames_file_obj)
# 5077102, 0
Note that #2740 addresses this by having info
load and scan the file object to compute the frame count. It affects only mp3 files and those of formats not handled by sox. If this behavior is not desired, one can use StreamReader
directly instead for fetching the info.
I wrote an audio in gsm
format using
torchaudio.backend.sox_io_backend.save(
filepath="audio.gsm",
src=signal,
sample_rate=8000,
channels_first=True,
format="gsm",
)
Both torchaudio.info
and StreamReader.get_src_stream_info
return num_frames=0.