audio
audio copied to clipboard
Pre-emphasis and its variants?
🚀 The feature
I would like to have or add (by myself, maybe?) pre-emphasis filtering into the audio processing step.
Motivation, pitch
As we all know, pre-emphasis boosts the amount of energy in the high frequencies, especially for voiced segments. At least for speaker verification tasks (and I believe as well as others), it is thus beneficial.
Alternatives
For furthering, there are actually some linear/nonlinear filtering/normalization operations can be integrated, most of which can be sourced from other audio toolkits like librosa. But I think we may focus on pre-emphasis in torchaudio.transforms and torchaudio.functional first.
Additional context
No response
Hi @underdogliu, thanks for the suggestion! We don't have objections against supporting pre-emphasis, but were wondering if you could elaborate a bit more on what you're referring to for the variants, and if there's any existing implementation/paper/references you can link regarding this?
@carolineechen Sorry for the late reply. Been bothered with many things in parallel.
So pre-emphasis is nothing but a time-domain FIR filter. By talking variants I mean there might be some other types of filter available in order to flatten the spectrum. But of course, we can just apply a minimal version. But you make the final decision.
One reference: https://mini.dcs.shef.ac.uk/wp-content/papercite-data/pdf/loweimi_nolisp13.pdf
@underdogliu got it, yea I think adding standard pre-emphasis to torchaudio transforms and functional (under filtering) could be a good starting point! Is this something you're interested in working on?
also quick question, would we need to add a corresponding de-emphasis function for this to be useful, or is that not necessary or already handled by torchaudio's deemph_biquad function?
Yeah if necessary I am happy to spend some time developing it while getting myself more familiar with how torchaudio works. Of course, such a first-order FIR filter at the time domain can be regarded as a special case (b_0=1, a_0=1, b_1=-alpha, other parameters are zero-valued) of the bi-quad function.
Speaking of that function, I also have a question that may be naive: when I was checking this function, I found most of the simple computations are done via math instead of torch. Is it because we are handling scalars? I am not sure about that especially when we wanna make certain parameters learnable (analogous to PCEN and learnable STFT).
@underdogliu a good start might be adopting https://github.com/csteinmetz1/auraloss/blob/main/auraloss/perceptual.py#L39
I hope we would be able to implement the pre-emphasis filtering with torchaudio.functional.lfilter. Can somebody pls comment on this ?
addressed in #2871