audio
audio copied to clipboard
[dicscussion] Batched CPU/GPU audio decoding / encoding
🚀 The feature
GPU audio decoding at least for some codecs is useful for wider usage of compressed audio for training ASR models.
Maybe some neural codecs (I think Google open-sourced some neural codecs) would be more amenable to batched GPU decoding and integration into models directly.
Motivation, pitch
N/A
Alternatives
No response
Additional context
No response
Hi
I am not aware of a GPU codec library for audio. Do you know one?
I'm also not aware, but maybe Lyra / SoundStream / LPCnet could be implemented on GPU (except maybe entropy coding). Also, maybe just some recommended codecs / settings optimized for fast decoding with DataLoader in DDP regime would be beneficial for community (e.g. how to configure it so that there is no thread oversubscription etc)
We do not have a good recommendation for the fast loading at the moment. A part of the reason is that the primal focus of the library has been providing domain-specific features that PyTorch does not provide natively. So we were focusing on the components rather than the pipeline. However, this is changing as the library matures and user base grows, and we do acknowledge the demand complex yet efficient data loading, and we are aware of the lack of our integral view of these components.
Having said that, there are couple of things I am considering for efficient data loading. (some of them are very random thoughts) tl;dr, I am thinking that re-designing the whole data loading experience will give more choices of solutions. It seems to me that libffcv like you mentioned in #1994 takes a similar approach.
- The use of faster decoding library This has been already suggested. Idea-wise it is straight forward but engineering-wise it is a huge effort as we need to think of multiple platforms and packaging, so I had not get to work on it. https://github.com/faroit/python_audio_loading_benchmark According to @ faoit, libsndfile is faster for loading wav file. I also check from time-to-time if NVdia provides codecs for GPUs, like they have for video and jpeg, but I do not think such thing will happen.
- fp16 https://github.com/pytorch/audio/issues/2097
- non-blocking loading. https://github.com/pytorch/audio/issues/1628 To improve the wall-time experience, I think performing decoding in background while minimizing the data transfer / memory copy is critical. The best way to achieve that is still unclear. Could be surrogate object for tensor or decoder class implementation or something else.
- ~The use of torchdata to reimplement dataset~ [torchdata project has been halted] ~Another key aspect is that even if non-blocking loading is available, if the Tensor is required immediately, the client code will be blocked on it. So I am wondering if the use of torchdata could help pushing the decoding to background while the client code is doing something other in foreground. However, from what I know, the nice feature torchdata provides is the composability/reusability of dataset implementation, I am not sure.~
- Use of file-backed mmap cache for decoded data? People often mention library like Apache Arrow for faster data access. Notably Hugging Face datasets use it and say how fast it is. I do not think columnar format would help, but fast access to memory-mapped data could be helpful. I am still researching this, and I do not think mmap-ing binary data before decode would help overall performance, so maybe, in training situation, and using it as caching waveform could help. This needs PoC but it is just an idea at the moment.
Regarding parallelized (intra-file, for large files) audio decoding:
- https://github.com/enzo1982/freac/issues/505
- https://github.com/xiph/opus/issues/289
- https://github.com/Harinlen/GPUraku
It might be possible to have a parallelized GPU decoding of some lightly-compressed files (e.g. FLAC or some other relatively simple audio codec which is targeted for fast branchless parallelizable decoding)
It might also be good to integrate some Facebook's neural codec code into torchaudio to widen exposure and usage, as neural codecs are the most amenable to fast GPU-based decoding :)
Also, batched parallelized audio reading/decoding can be used for speeding-up simple high-level methods like whisper's model.transcribe(['audio1.opus', 'audio2.opus', ...]): https://github.com/openai/whisper/discussions/662#discussioncomment-7524821. Probably the right way for this would be always returning a NestedTensor as output (and allowing finer controls if out= is provided). Might also be interesting to support some sort of background processing mode which returns some LazyTensor output immediately.
Regarding the wav loading, I don't think it can go beyond simply reading from disk in a single large chunk as does python's wave package or scipy's scipy.io.wavfile.read - including mmap btw which in some cases may allow ammortizing the mem-access or disk-reading cost. I think, PyTorch needs to have a similar builtin simple function for dealing with such simple file formats (like WAV/PPM/OBJ/CSV file formats etc).
I think one needs C/C++ thread pool library to implement true batch decoding. I have some idea, but I feel that it is not a good fit for torchaudio or other domain libraries.
Maybe some standard openmp threading would cut it?
Might be a better fit for this new i/o package :)
Maybe some standard openmp threading would cut it?
PyTorch uses OpenMP as well, so I think it is better to have a separate parallelisms so that they can be configured in independent manner. I feel that it will be also better to have different parallelism for I/O-bound part (file access and networking) and CPU-bound part (decoding).
Might be a better fit for this new i/o package :)
The new i/o package in discussion is upstream of the existing domain libraries. I have a feeling that such serious project could better get started outside of the existing context.