audio icon indicating copy to clipboard operation
audio copied to clipboard

[v0.12] torchaudio.info reports num_frames=0 for MP3

Open iceychris opened this issue 2 years ago • 8 comments

🐛 Describe the bug

First, download a wav and a mp3 file:

wget https://filesamples.com/samples/audio/wav/sample3.wav
wget https://filesamples.com/samples/audio/mp3/sample3.mp3

Here is a short repro:

import torchaudio

# try reading number of frames
# wav is fine
num_frames = torchaudio.info("sample3.wav").num_frames
assert num_frames == 4664587, num_frames

# opening mp3 works
audio, sr = torchaudio.load("sample3.mp3")
print(audio.shape, sr)

# fetching mp3 info fails silently
num_frames = torchaudio.info("sample3.mp3").num_frames
assert num_frames > 0, num_frames

Running this results in:

/home/chris/tmp/torchaudio-issue/venv/lib/python3.8/site-packages/torchaudio/compliance/kaldi.py:22: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:68.)
  EPSILON = torch.tensor(torch.finfo(torch.float).eps)
torch.Size([2, 4664587]) 44100
Traceback (most recent call last):
  File "repro.py", line 14, in <module>
    assert num_frames > 0, num_frames
AssertionError: 0

Maybe this is a bug introduced with the recent switch to ffmpeg for mp3 handling.

Workaround

A (arguably bad) workaround is to use torchaudio.load and then calculate the number of frames via the shape of the audio.

Sadly this takes a lot longer than using torchaudio.info for me.

Versions

Collecting environment information...
PyTorch version: 1.12.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Mar 15 2022, 12:22:08)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-41-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 470.103.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] torch==1.12.0
[pip3] torchaudio==0.12.0
[conda] Could not collect

ffmpeg -version

ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil      56. 31.100 / 56. 31.100
libavcodec     58. 54.100 / 58. 54.100
libavformat    58. 29.100 / 58. 29.100
libavdevice    58.  8.100 / 58.  8.100
libavfilter     7. 57.100 /  7. 57.100
libavresample   4.  0.  0 /  4.  0.  0
libswscale      5.  5.100 /  5.  5.100
libswresample   3.  5.100 /  3.  5.100
libpostproc    55.  5.100 / 55.  5.100

iceychris avatar Jun 30 '22 18:06 iceychris

Hi @iceychris

Unfortunately this is how it is starting 0.12.

We swapped the mp3 decoder. In my understanding MP3 container does not contain this information in header and originally, libsox was reporting the number of frames by parsing the entire file, but libavformat does not perform this and reports 0. The ffmpeg command does extra step to estimate the duration of the input audio but not the exact frame numbers. The thing is that we cannot be sure if this estimate performed by ffmpeg is correct, and that is the reason I could not port such feature. I understand that this is very bad for entire user community, but I did not have much choice.

One another possible workaround is to export duration (say in seconds), and ask you to migrate to StreamReader based metadata fetching.

I know this kind of BC-breaking change causes terrible UX and I wanted to avoid as much but there is little I could control on this one.

mthrok avatar Jun 30 '22 18:06 mthrok

I just ran into the same issue when testing lhotse against the latest PyTorch in https://github.com/lhotse-speech/lhotse/pull/764. I don't mind handling MP3 with StreamReader as a special case. Would the duration info you mentioned be more reliable than ffmpeg estimation? And if yes, could you show an example how to fetch it?

pzelasko avatar Jul 06 '22 22:07 pzelasko

being devils advocate for mp3: this was always an issue with mp3 encoders and they provide different paddings thus it is still a bit luck that the decoder is able to remove the padding completely. IMO its better to prepare the users with a documation hint

faroit avatar Jul 07 '22 08:07 faroit

would it be possible if the input is an mp3 to have torchaudio.info internally call torchaudio.load the file to get its metadata?

lukasschmit avatar Jul 09 '22 23:07 lukasschmit

would it be possible if the input is an mp3 to have torchaudio.info internally call torchaudio.load the file to get its metadata?

I think there should be some warning as well. I mean general use of torchaudio in my case(an possible for many) is to grab audio lengths, num frames or/and sample rate fast, loading will definitely slow it down

ZurabDz avatar Jul 10 '22 19:07 ZurabDz

num_frames is 0 with opus loaded as BytesIO as well. To reproduce,

Download

wget https://filesamples.com/samples/audio/opus/sample3.opus

Run

import torchaudio
import io

torchaudio.utils.sox_utils.set_buffer_size(16000)

num_frames = torchaudio.info("sample3.opus").num_frames

num_frames_file_obj = torchaudio.info(io.BytesIO(open("sample3.opus", "rb").read())).num_frames

print(num_frames, num_frames_file_obj)
# 5077102, 0

scarecrow1123 avatar Aug 23 '22 10:08 scarecrow1123

Note that #2740 addresses this by having info load and scan the file object to compute the frame count. It affects only mp3 files and those of formats not handled by sox. If this behavior is not desired, one can use StreamReader directly instead for fetching the info.

hwangjeff avatar Oct 07 '22 17:10 hwangjeff

I wrote an audio in gsm format using

    torchaudio.backend.sox_io_backend.save(
        filepath="audio.gsm",
        src=signal,
        sample_rate=8000,
        channels_first=True,
        format="gsm",
    )

Both torchaudio.info and StreamReader.get_src_stream_info return num_frames=0.

DanTremonti avatar Jul 21 '23 09:07 DanTremonti