What does this PR do?

Recently the audio_utils.py file was added to Transformers to provide shared functions for audio processing such as STFT. This PR aims to clean up the code and make the API more robust.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Mar 07 '23 14:03 hollance

The documentation is not available anymore as the PR was closed or merged.

Mar 07 '23 14:03 HuggingFaceDocBuilderDev

I cleaned up hertz_to_mel and mel_to_hertz a bit:

more consistent doc comments
both support single float inputs as well as numpy arrays
simplified the formulas so it's not literally the same as the librosa code but also doesn't do pointless calculations

Since I think this implementation was based on librosa, we should also give them credit.

Mar 07 '23 15:03 hollance

I rewrote power_to_db and added amplitude_to_db. They still work like the librosa versions but with argument names that make more sense to me.

Mar 28 '23 14:03 hollance

Changed get_mel_filter_banks into mel_filter_bank. Mostly renamed arguments and variables and cleaned up the doc comments, so that the naming is more in line with the rest of Transformers, e.g. num_frequency_bins instead of nb_frequency_bins.

Mar 29 '23 11:03 hollance

Pushed significant changes to the stft code.

Removed fram_wave; this is really an implementation detail that should happen inside the STFT.
The new stft gives the same results as librosa and torchaudio for the same options. It's 25% faster than the previous implementation, mostly due to using rfft instead of fft (since the input is always real-only, not complex).
librosa is still faster since they use a bunch of tricks under the hood to avoid memory copies etc; we can slowly work towards matching this speed (not super important to do this immediately since the new stft is already faster than what we had before)
No batching yet.

I will be replacing the other hand-rolled STFTs with this soon (also in this PR).

None of the changes I made are set in stone — feel free to discuss things like the argument names, the shapes of the returned tensors, and so on.

Apr 13 '23 13:04 hollance

Replaced the hand-rolled STFT in the different models with the one from audio_utils:

CLAP
M-CTC-T
SpeechT5
TVLT
Whisper

Did not do audio_spectrogram_transformer and speech_to_text. These use ta_kaldi.fbank, which is simple enough and faster than audio_utils. If we want to get completely rid of torchaudio we could also replace these.

Apr 18 '23 13:04 hollance

@sanchit-gandhi @ArthurZucker I think this is ready for review now. Feel free to look at this with a critical eye!

The STFT code is currently written for ease of understanding and flexibility, not speed, although it does outperform the previous methods we were using.

Apr 18 '23 13:04 hollance

@sanchit-gandhi @ArthurZucker Are you OK with the PR in its current state? Then I can ask a core maintainer for a final review.

May 01 '23 09:05 hollance

Took a second look through and the changes LGTM @hollance!

May 02 '23 16:05 sanchit-gandhi

If everyone's happy with it, feel free to merge (I don't have rights).

May 09 '23 08:05 hollance

audio_utils improvements

What does this PR do?

Before submitting

Who can review?