New `ffmpeg` backend changes samples when saving WAVE
🐛 Describe the bug
Snippet to reproduce the error is provided below. Adding backend="sox" or backend="soundfile" to torchaudio.save removes the issue.
import os
from tempfile import NamedTemporaryFile
os.environ["TORCHAUDIO_USE_BACKEND_DISPATCHER"] = "1"
import torch
import torchaudio
torch.manual_seed(0)
noise = torch.rand(1, 32000, dtype=torch.float32)
with NamedTemporaryFile(suffix=".wav") as f:
torchaudio.save(f.name, noise, sample_rate=16000)
f.flush()
f.seek(0)
noise_load, _ = torchaudio.load(f)
torch.testing.assert_close(noise_load, noise)
Output:
Traceback (most recent call last):
File "/Users/pzelasko/Library/Application Support/JetBrains/PyCharm2023.1/scratches/scratch_12.py", line 19, in <module>
torch.testing.assert_close(noise_load, noise)
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1511, in assert_close
raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!
Mismatched elements: 9760 / 32000 (30.5%)
Greatest absolute difference: 1.52587890625e-05 at index (0, 134) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0, 24308) (up to 1.3e-06 allowed)
Versions
Collecting environment information... PyTorch version: 2.0.0 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A
OS: macOS 13.3.1 (arm64) GCC version: Could not collect Clang version: 14.0.3 (clang-1403.0.22.14.1) CMake version: version 3.25.0 Libc version: N/A
Python version: 3.10.4 (main, Mar 31 2022, 03:37:37) [Clang 12.0.0 ] (64-bit runtime) Python platform: macOS-13.3.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Apple M1 Max
Versions of relevant libraries: [pip3] flake8==5.0.4 [pip3] k2==1.23.4.dev20230412+cpu.torch2.0.0 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.5 [pip3] torch==2.0.0 [pip3] torchaudio==2.0.0 [pip3] torchvision==0.15.0 [conda] k2 1.23.4.dev20230412+cpu.torch2.0.0 pypi_0 pypi [conda] numpy 1.23.5 py310hb93e574_0 [conda] numpy-base 1.23.5 py310haf87e8b_0 [conda] pytorch 2.0.0 py3.10_0 pytorch [conda] torch 1.12.1 pypi_0 pypi [conda] torchaudio 2.0.0 py310_cpu pytorch [conda] torchvision 0.15.0 py310_cpu pytorch
I think what is happening is that for WAV, ffmpeg defaults to int16, so the test is causing some discrepancy, but the discrepancy is at most in the order of e-5.
This is due to how the underlying implementation StreamReader work. It picks a default precision of the format, and the default is governed by FFmpeg's mechanism.
It is possible to make the behavior match the previous backends, but I think there were user feedbacks that int16 is better as that's what vast majority of audio system expects and many do not understand other precision.
One reason why the existing backend picked the matching precision is to preserve the data as precise as it was returned by the model for the sake of scientific computation.
What do you think? @pzelasko @hwangjeff
Good insight! I was able to validate that you're right by replacing noise generation like this:
INT16MAX = 32768
noise = torch.randint(-INT16MAX, INT16MAX - 1, (1, 32000))
noise = noise / INT16MAX
I think it makes sense, it's the most common format and people rarely need the actual float32 precision when saving files. I only found out because some of Lhotse unit tests for correct save->load behavior failed when moving to ffmpeg, but they used artificial data anyway.
In that case you might want to update the documentation here:
https://github.com/pytorch/audio/blob/151ac4d85007c7e14d9ece9023a2c2b4b0cc6a40/torchaudio/_backend/utils.py#L533-L550