audio torchaudio.io._compat.load_audio_fileobj returns shorter waveform than torchaudio.load for certain FLAC files

🐛 Describe the bug

When using the two loading methods on the same audio file, the lengths of the waveform tensors are different.

I can reproduce this issue with one audio file from LibriSpeech dataset, note that the issue only exists in some FLAC files. Here is the code to reproduce it:

import torchaudio
waveform_load, _ = torchaudio.load("LibriSpeech/train-other-500/5350/205002/5350-205002-0014.flac")
print(waveform_load.shape)
with open("LibriSpeech/train-other-500/5350/205002/5350-205002-0014.flac", "rb") as f:
    waveform_file, _= torchaudio.io._compat.load_audio_fileobj(f)
prinyt(waveform_file.shape)

The output is:

torch.Size([1, 194160])
torch.Size([1, 184320])

Versions

PyTorch version: 1.13.0.dev20220812 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 12.5.1 (x86_64) GCC version: Could not collect Clang version: 13.1.6 (clang-1316.0.21.2.5) CMake version: version 3.22.1 Libc version: N/A

Python version: 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ] (64-bit runtime) Python platform: macOS-10.16-x86_64-i386-64bit Is CUDA available: False CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.22.4 [pip3] pytorch-sphinx-theme==0.0.24 [pip3] torch==1.13.0.dev20220812 [pip3] torchaudio==0.13.0a0+a42b266 [pip3] torchroomacoustics==0.0.1 [conda] blas 1.0 mkl
[conda] mkl 2021.2.0 pypi_0 pypi [conda] numpy 1.22.4 pypi_0 pypi [conda] pytorch 1.13.0.dev20220718 py3.10_0 pytorch-nightly [conda] pytorch-sphinx-theme 0.0.24 dev_0 [conda] torch 1.13.0.dev20220812 pypi_0 pypi [conda] torchaudio 0.13.0a0+a42b266 dev_0 [conda] torchroomacoustics 0.0.1 pypi_0 pypi

Sep 16 '22 12:09 nateanl

What value do tools like ffprove and soxi report?

Sep 16 '22 12:09 mthrok

Note that torchaudio.io._compat.load_audio_fileobj is not a publicly documented method, so it behavior is subject to change anytime.

Sep 16 '22 12:09 mthrok

ffprobe shows

Input #0, flac, from '5350-205002-0014.flac':
  Duration: 00:00:12.14, start: 0.000000, bitrate: 153 kb/s
    Stream #0:0: Audio: flac, 16000 Hz, mono, s16

soxi shows:

Input File     : '5350-205002-0014.flac'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:12.13 = 194160 samples ~ 910.125 CDDA sectors
File Size      : 233k
Bit Rate       : 154k
Sample Encoding: 16-bit FLAC

Sep 16 '22 12:09 nateanl

Can you have ffprobe report the number of frames? -count_frames

Sep 16 '22 13:09 mthrok

hmm, it doesn't output the exact frame count:

ffprobe version 4.2.2 Copyright (c) 2007-2019 the FFmpeg developers
  built with clang version 4.0.1 (tags/RELEASE_401/final)
  configuration: --prefix=/miniconda3/envs/torch --cc=x86_64-apple-darwin13.4.0-clang --disable-doc --enable-avresample --enable-gmp --enable-hardcoded-tables --enable-libfreetype --enable-libvpx --enable-pthreads --enable-libopus --enable-postproc --enable-pic --enable-pthreads --enable-shared --enable-static --enable-version3 --enable-zlib --enable-libmp3lame --disable-nonfree --enable-gpl --enable-gnutls --disable-openssl --enable-libopenh264 --enable-libx264
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Input #0, flac, from '5350-205002-0014.flac':
  Duration: 00:00:12.14, start: 0.000000, bitrate: 153 kb/s
    Stream #0:0: Audio: flac, 16000 Hz, mono, s16

Sep 16 '22 13:09 nateanl

Using command ffprobe -select_streams a -show_streams 5350-205002-0014.flac gets the following info:

Input #0, flac, from '5350-205002-0014.flac':
  Duration: 00:00:12.14, start: 0.000000, bitrate: 153 kb/s
    Stream #0:0: Audio: flac, 16000 Hz, mono, s16
[STREAM]
index=0
codec_name=flac
codec_long_name=FLAC (Free Lossless Audio Codec)
profile=unknown
codec_type=audio
codec_time_base=1/16000
codec_tag_string=[0][0][0][0]
codec_tag=0x0000
sample_fmt=s16
sample_rate=16000
channels=1
channel_layout=mono
bits_per_sample=0
id=N/A
r_frame_rate=0/0
avg_frame_rate=0/0
time_base=1/16000
start_pts=0
start_time=0.000000
duration_ts=194160
duration=12.135000
bit_rate=N/A
max_bit_rate=N/A
bits_per_raw_sample=16
nb_frames=N/A
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=0
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
DISPOSITION:timed_thumbnails=0
[/STREAM]

Sep 18 '22 01:09 nateanl

I don't remember exactly but IIRC you need to explicitly instruct ffprobe to sum up the number of frames somehow

Perhaps try something from https://stackoverflow.com/questions/2017843/fetch-frame-count-with-ffmpeg ?

Sep 20 '22 13:09 mthrok

Not sure if I use the command correctly, but when I run ffprobe -v error -select_streams a:0 -count_packets -show_entries stream=nb_read_packets -of csv=p=0 5350-205002-0014.flac I got only 48 as the output...

Oct 24 '22 16:10 nateanl

Addressed by #2810, after changing the buffer_size from 4096 to 4100, the shape of the waveform is correct (torch.Size([1, 194160])). Will change it to 8000 in case the buffer is not large enough.

Nov 18 '22 21:11 nateanl