data
data copied to clipboard
Loading audio files from archives
🐛 Describe the bug
I've been playing around with torchdata as a replacement for the webdataset library. My main use-case is reading data from network-attached file systems (such as ceph), which implies streaming from e.g. .tar files, which is something webdataset is designed for.
In the following code I have the following relative file system: data.zip
├── file
│ ├── 19-198-0000.flac
│ └── 19-198-0000.wav
├── tar
│ ├── flac.tar
│ └── wav.tar
└── zip
├── flac.zip
└── wav.zip
Where each .zip or .tar archive contains respectively the 19-198-0000.flac or 19-198-0000.wav file taken from the LibriSpeech dataset.
From my reading of the documentation, this seams the easiest way to read from the archive:
import torchaudio.backend.sox_io_backend as tab
from torchdata.datapipes.iter import (
FileLister,
FileOpener,
TarArchiveLoader,
ZipArchiveLoader,
Mapper,
)
def audio_stream_to_tensor(element):
path, stream = element
audio_tensor, sample_rate = tab.load(stream)
return audio_tensor
dp = FileLister(".", masks=["wav.tar"], recursive=True)
dp = FileOpener(dp, mode="b")
dp = TarArchiveLoader(dp, mode="r")
dp = Mapper(dp, audio_stream_to_tensor)
for x in dp:
print(x) # tensor([[0.0044, 0.0033, 0.0031, ..., 0.0047, 0.0060, 0.0060]])
This works :)! However, it fails when we try to read the flac.tar
dp = FileLister(".", masks=["flac.tar"], recursive=True)
dp = FileOpener(dp, mode="b")
dp = TarArchiveLoader(dp, mode="r")
dp = Mapper(dp, audio_stream_to_tensor)
for x in dp:
print(x)
formats: can't open input file `': FLAC ERROR whilst decoding metadata
Traceback (most recent call last):
File "/home/nik/phd/repo/librispeech/playground/example.py", line 35, in <module>
for x in dp:
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
response = gen.send(None)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
yield self._apply_fn(data)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
return self.fn(data)
File "/home/nik/phd/repo/librispeech/playground/example.py", line 15, in audio_stream_to_tensor
audio_tensor, sample_rate = tab.load(stream)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 220, in load
return _fallback_load_fileobj(filepath, frame_offset, num_frames, normalize, channels_first, format)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 109, in load_audio_fileobj
s = torchaudio._torchaudio_ffmpeg.StreamReaderFileObj(src, format, None, 4096)
RuntimeError: Failed to open the input "StreamWrapper<<ExFileObject name='./tar/flac.tar'>>" (Invalid data found when processing input).
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=TarArchiveLoaderIterDataPipe, fn=audio_stream_to_tensor, input_col=None, output_col=None)
Similarly for ZipArchiveLoader, reading from wav.zip works, while flac.zip returns a similar error:
formats: can't open input file `': FLAC ERROR whilst decoding metadata
Traceback (most recent call last):
File "/home/nik/phd/repo/librispeech/playground/example.py", line 35, in <module>
for x in dp:
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
response = gen.send(None)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
yield self._apply_fn(data)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
return self.fn(data)
File "/home/nik/phd/repo/librispeech/playground/example.py", line 15, in audio_stream_to_tensor
audio_tensor, sample_rate = tab.load(stream)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 220, in load
return _fallback_load_fileobj(filepath, frame_offset, num_frames, normalize, channels_first, format)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 109, in load_audio_fileobj
s = torchaudio._torchaudio_ffmpeg.StreamReaderFileObj(src, format, None, 4096)
RuntimeError: Failed to open the input "StreamWrapper<<zipfile.ZipExtFile name='19-198-0000.flac' mode='r' compress_type=deflate>>" (Invalid data found when processing input).
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=ZipArchiveLoaderIterDataPipe, fn=audio_stream_to_tensor, input_col=None, output_col=None)
Moreover, adding torchaudio.info to the map function also leads to the same issue for .wav files:
def audio_stream_to_tensor_and_meta(element):
path, stream = element
meta = tab.info(stream)
print(meta)
audio_tensor, sample_rate = tab.load(stream)
return audio_tensor, meta
dp = FileLister(".", masks=["wav.tar"], recursive=True)
dp = FileOpener(dp, mode="b")
dp = TarArchiveLoader(dp, mode="r")
dp = Mapper(dp, audio_stream_to_tensor_and_meta)
for x in dp:
print(x)
AudioMetaData(sample_rate=16000, num_frames=31440, num_channels=1, bits_per_sample=16, encoding=PCM_S)
formats: can't determine type of file `'
Traceback (most recent call last):
File "/home/nik/phd/repo/librispeech/playground/example.py", line 36, in <module>
for x in dp:
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
response = gen.send(None)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
yield self._apply_fn(data)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
return self.fn(data)
File "/home/nik/phd/repo/librispeech/playground/example.py", line 25, in audio_stream_to_tensor_and_meta
audio_tensor, sample_rate = tab.load(stream)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 220, in load
return _fallback_load_fileobj(filepath, frame_offset, num_frames, normalize, channels_first, format)
File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 109, in load_audio_fileobj
s = torchaudio._torchaudio_ffmpeg.StreamReaderFileObj(src, format, None, 4096)
RuntimeError: Failed to open the input "StreamWrapper<<ExFileObject name='./tar/wav.tar'>>" (Invalid data found when processing input).
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=TarArchiveLoaderIterDataPipe, fn=audio_stream_to_tensor_and_meta, input_col=None, output_col=None)
So I assume that the issues stem from the fact that the stream provided by torchdata is not seekable, or at least the buffer is not large enough?
Versions
PyTorch version: 1.12.1+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31
Python version: 3.10.4 (main, Apr 20 2022, 11:26:44) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.5.119 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Nvidia driver version: 495.29.05 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.2 [pip3] torch==1.12.1 [pip3] torchaudio==0.12.1 [pip3] torchdata==0.4.1 [conda] Could not collect
I've also tried the soundfile backend. Soundfile can read the .flac file correctly from the stream, but it fails when we call info() on the stream before load().
RuntimeError: Failed to open the input "StreamWrapper<<zipfile.ZipExtFile name='19-198-0000.flac' mode='r' compress_type=deflate>>" (Invalid data found when processing input).
Based on the traceback, I think it's about how does torchaudio expect the input type. It would be easier for us to understand the functionality of tab.load. Does it support loading inner file streams from tar? cc: @mthrok
Regarding your comment about seekable, at least tar file stream should be seekable. So, I assume this won't be the root cause.
As a workaround, could you read data from the opened file stream directly before sending to tab.load?
def audio_stream_to_tensor_and_meta(element):
path, stream = element
data = b"".join(stream)
meta = tab.info(data)
audio_tensor, sample_rate = tab.load(data)
return audio_tensor, meta
Thanks for your comment.
As a workaround, could you read data from the opened file stream directly before sending to tab.load?
Your code sample throws the following errors:
(for wav)
Traceback (most recent call last):
File "/home/nik/phd/repo/data_utility/playground/example.py", line 39, in <module>
for x in dp:
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
response = gen.send(None)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
yield self._apply_fn(data)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
return self.fn(data)
File "/home/nik/phd/repo/data_utility/playground/example.py", line 17, in audio_stream_to_tensor
audio_tensor, sample_rate = tab.load(data)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 227, in load
return _fallback_load(filepath, frame_offset, num_frames, normalize, channels_first, format)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 97, in load_audio
s = torch.classes.torchaudio.ffmpeg_StreamReader(src, format, None)
RuntimeError
(for flac)
Traceback (most recent call last):
File "/home/nik/phd/repo/data_utility/playground/example.py", line 39, in <module>
for x in dp:
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
response = gen.send(None)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
yield self._apply_fn(data)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
return self.fn(data)
File "/home/nik/phd/repo/data_utility/playground/example.py", line 17, in audio_stream_to_tensor
audio_tensor, sample_rate = tab.load(data)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 227, in load
return _fallback_load(filepath, frame_offset, num_frames, normalize, channels_first, format)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 97, in load_audio
s = torch.classes.torchaudio.ffmpeg_StreamReader(src, format, None)
RuntimeError: Failed to open the input "fLaC
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=TarArchiveLoaderIterDataPipe, fn=audio_stream_to_tensor, input_col=None, output_col=None)
However, simply using stream.seek(0) between tab.info() and tab.load() solves the issue for both TarArchiveLoader and ZipArchiveLoader. It this something which is worth documenting?
Moreover, loading .flac files remains an issue for the sox_io backend. But I guess that now seems to be an issue related to torchaudio?
However, simply using
stream.seek(0)betweentab.info()andtab.load()solves the issue for bothTarArchiveLoaderandZipArchiveLoader. It this something which is worth documenting?
info consumes some bytes from file-like object, so it calling load after that would fail without reseting the position of the input file object.
Moreover, loading
.flacfiles remains an issue for thesox_iobackend. But I guess that now seems to be an issue related to torchaudio?
There are reports filed recently on file-like object loading of FLAC format. I haven't looked into the detail yet, but meanwhile I think ffmpeg-based solution could work. Can you tell what happens if you replace load function with torchaudio.io._compat.load_audio_fileobj?
Replacing load with torchaudio.io._compat.load_audio_fileobj results in the flac stream correctly loading.
Similarly, replacing info with torchaudio.io._compat.info_audio_fileobj(stream, format='flac') results in the flac stream info loading.
AudioMetaData(sample_rate=16000, num_frames=0, num_channels=1, bits_per_sample=16, encoding=FLAC)
However, num_frames=0 is incorrect.
Using info(stream, format='flac') does work, but also gives an error (and num_frames=0 is wrong):
def audio_stream_to_tensor_and_meta(element):
path, stream = element
meta = torchaudio.info(stream, format='flac')
stream.seek(0)
audio_tensor, sample_rate = torchaudio.io._compat.load_audio_fileobj(stream)
return audio_tensor, meta
formats: can't open input file `': FLAC ERROR whilst decoding metadata
tensor([[0.0044, 0.0033, 0.0031, ..., 0.0047, 0.0060, 0.0060]])
AudioMetaData(sample_rate=16000, num_frames=0, num_channels=1, bits_per_sample=16, encoding=FLAC)
Using only info(stream), without format="flac":
formats: can't open input file `': FLAC ERROR whilst decoding metadata
Traceback (most recent call last):
File "/home/nik/phd/repo/data_utility/playground/example.py", line 41, in <module>
for x in dp:
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
response = gen.send(None)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
yield self._apply_fn(data)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
return self.fn(data)
File "/home/nik/phd/repo/data_utility/playground/example.py", line 26, in audio_stream_to_tensor_and_meta
meta = torchaudio.info(stream)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 99, in info
return _fallback_info_fileobj(filepath, format)
File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 35, in info_audio_fileobj
s = torchaudio._torchaudio_ffmpeg.StreamReaderFileObj(src, format, None, 4096)
RuntimeError: Failed to open the input "StreamWrapper<<ExFileObject name='./tar/flac.tar'>>" (Invalid data found when processing input).
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=TarArchiveLoaderIterDataPipe, fn=audio_stream_to_tensor_and_meta, input_col=None, output_col=None)
FFMPEG output of the file:
$ ffmpeg -i playground/file/19-198-0000.flac
...
Input #0, flac, from 'playground/file/19-198-0000.flac':
Duration: 00:00:01.97, start: 0.000000, bitrate: 177 kb/s
Stream #0:0: Audio: flac, 16000 Hz, mono, s16
Reading from the file directly:
torchaudio.info('19-198-0000.flac")
AudioMetaData(sample_rate=16000, num_frames=31440, num_channels=1, bits_per_sample=16, encoding=FLAC)
Maybe you can try thisstream.file_obj.read() to get bytes:
def audio_stream_to_tensor_and_meta(element):
path, stream = element
stream = stream.file_obj.read()
...
return audio_tensor, meta