audio_utils improvements
What does this PR do?
Recently the audio_utils.py file was added to Transformers to provide shared functions for audio processing such as STFT. This PR aims to clean up the code and make the API more robust.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
The documentation is not available anymore as the PR was closed or merged.
I cleaned up hertz_to_mel and mel_to_hertz a bit:
- more consistent doc comments
- both support single float inputs as well as numpy arrays
- simplified the formulas so it's not literally the same as the librosa code but also doesn't do pointless calculations
Since I think this implementation was based on librosa, we should also give them credit.
I rewrote power_to_db and added amplitude_to_db. They still work like the librosa versions but with argument names that make more sense to me.
Changed get_mel_filter_banks into mel_filter_bank. Mostly renamed arguments and variables and cleaned up the doc comments, so that the naming is more in line with the rest of Transformers, e.g. num_frequency_bins instead of nb_frequency_bins.
Pushed significant changes to the stft code.
-
Removed
fram_wave; this is really an implementation detail that should happen inside the STFT. -
The new
stftgives the same results as librosa and torchaudio for the same options. It's 25% faster than the previous implementation, mostly due to usingrfftinstead offft(since the input is always real-only, not complex). -
librosa is still faster since they use a bunch of tricks under the hood to avoid memory copies etc; we can slowly work towards matching this speed (not super important to do this immediately since the new
stftis already faster than what we had before) -
No batching yet.
I will be replacing the other hand-rolled STFTs with this soon (also in this PR).
None of the changes I made are set in stone — feel free to discuss things like the argument names, the shapes of the returned tensors, and so on.
Replaced the hand-rolled STFT in the different models with the one from audio_utils:
- CLAP
- M-CTC-T
- SpeechT5
- TVLT
- Whisper
Did not do audio_spectrogram_transformer and speech_to_text. These use ta_kaldi.fbank, which is simple enough and faster than audio_utils. If we want to get completely rid of torchaudio we could also replace these.
@sanchit-gandhi @ArthurZucker I think this is ready for review now. Feel free to look at this with a critical eye!
The STFT code is currently written for ease of understanding and flexibility, not speed, although it does outperform the previous methods we were using.
@sanchit-gandhi @ArthurZucker Are you OK with the PR in its current state? Then I can ask a core maintainer for a final review.
Took a second look through and the changes LGTM @hollance!
If everyone's happy with it, feel free to merge (I don't have rights).