torchaudio.io._compat.load_audio_fileobj returns shorter waveform than torchaudio.load for certain FLAC files
🐛 Describe the bug
When using the two loading methods on the same audio file, the lengths of the waveform tensors are different.
I can reproduce this issue with one audio file from LibriSpeech dataset, note that the issue only exists in some FLAC files. Here is the code to reproduce it:
import torchaudio
waveform_load, _ = torchaudio.load("LibriSpeech/train-other-500/5350/205002/5350-205002-0014.flac")
print(waveform_load.shape)
with open("LibriSpeech/train-other-500/5350/205002/5350-205002-0014.flac", "rb") as f:
waveform_file, _= torchaudio.io._compat.load_audio_fileobj(f)
prinyt(waveform_file.shape)
The output is:
torch.Size([1, 194160])
torch.Size([1, 184320])
Versions
PyTorch version: 1.13.0.dev20220812 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A
OS: macOS 12.5.1 (x86_64) GCC version: Could not collect Clang version: 13.1.6 (clang-1316.0.21.2.5) CMake version: version 3.22.1 Libc version: N/A
Python version: 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ] (64-bit runtime) Python platform: macOS-10.16-x86_64-i386-64bit Is CUDA available: False CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] pytorch-sphinx-theme==0.0.24
[pip3] torch==1.13.0.dev20220812
[pip3] torchaudio==0.13.0a0+a42b266
[pip3] torchroomacoustics==0.0.1
[conda] blas 1.0 mkl
[conda] mkl 2021.2.0 pypi_0 pypi
[conda] numpy 1.22.4 pypi_0 pypi
[conda] pytorch 1.13.0.dev20220718 py3.10_0 pytorch-nightly
[conda] pytorch-sphinx-theme 0.0.24 dev_0
What value do tools like ffprove and soxi report?
Note that torchaudio.io._compat.load_audio_fileobj is not a publicly documented method, so it behavior is subject to change anytime.
ffprobe shows
Input #0, flac, from '5350-205002-0014.flac':
Duration: 00:00:12.14, start: 0.000000, bitrate: 153 kb/s
Stream #0:0: Audio: flac, 16000 Hz, mono, s16
soxi shows:
Input File : '5350-205002-0014.flac'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:12.13 = 194160 samples ~ 910.125 CDDA sectors
File Size : 233k
Bit Rate : 154k
Sample Encoding: 16-bit FLAC
Can you have ffprobe report the number of frames? -count_frames
hmm, it doesn't output the exact frame count:
ffprobe version 4.2.2 Copyright (c) 2007-2019 the FFmpeg developers
built with clang version 4.0.1 (tags/RELEASE_401/final)
configuration: --prefix=/miniconda3/envs/torch --cc=x86_64-apple-darwin13.4.0-clang --disable-doc --enable-avresample --enable-gmp --enable-hardcoded-tables --enable-libfreetype --enable-libvpx --enable-pthreads --enable-libopus --enable-postproc --enable-pic --enable-pthreads --enable-shared --enable-static --enable-version3 --enable-zlib --enable-libmp3lame --disable-nonfree --enable-gpl --enable-gnutls --disable-openssl --enable-libopenh264 --enable-libx264
libavutil 56. 31.100 / 56. 31.100
libavcodec 58. 54.100 / 58. 54.100
libavformat 58. 29.100 / 58. 29.100
libavdevice 58. 8.100 / 58. 8.100
libavfilter 7. 57.100 / 7. 57.100
libavresample 4. 0. 0 / 4. 0. 0
libswscale 5. 5.100 / 5. 5.100
libswresample 3. 5.100 / 3. 5.100
libpostproc 55. 5.100 / 55. 5.100
Input #0, flac, from '5350-205002-0014.flac':
Duration: 00:00:12.14, start: 0.000000, bitrate: 153 kb/s
Stream #0:0: Audio: flac, 16000 Hz, mono, s16
Using command ffprobe -select_streams a -show_streams 5350-205002-0014.flac gets the following info:
Input #0, flac, from '5350-205002-0014.flac':
Duration: 00:00:12.14, start: 0.000000, bitrate: 153 kb/s
Stream #0:0: Audio: flac, 16000 Hz, mono, s16
[STREAM]
index=0
codec_name=flac
codec_long_name=FLAC (Free Lossless Audio Codec)
profile=unknown
codec_type=audio
codec_time_base=1/16000
codec_tag_string=[0][0][0][0]
codec_tag=0x0000
sample_fmt=s16
sample_rate=16000
channels=1
channel_layout=mono
bits_per_sample=0
id=N/A
r_frame_rate=0/0
avg_frame_rate=0/0
time_base=1/16000
start_pts=0
start_time=0.000000
duration_ts=194160
duration=12.135000
bit_rate=N/A
max_bit_rate=N/A
bits_per_raw_sample=16
nb_frames=N/A
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=0
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
DISPOSITION:timed_thumbnails=0
[/STREAM]
I don't remember exactly but IIRC you need to explicitly instruct ffprobe to sum up the number of frames somehow
Perhaps try something from https://stackoverflow.com/questions/2017843/fetch-frame-count-with-ffmpeg ?
Not sure if I use the command correctly, but when I run ffprobe -v error -select_streams a:0 -count_packets -show_entries stream=nb_read_packets -of csv=p=0 5350-205002-0014.flac I got only 48 as the output...
Addressed by #2810, after changing the buffer_size from 4096 to 4100, the shape of the waveform is correct (torch.Size([1, 194160])). Will change it to 8000 in case the buffer is not large enough.