audio torchaudio.compliance.kaldi.fbank dither is different from kaldi

torchaudio.compliance.kaldi.fbank dither is different from kaldi

Open yzwu2017 opened this issue 3 years ago • 4 comments

🐛 Describe the bug

for the function "torchaudio.compliance.kaldi.fbank", there is an option "dither". The function calls _get_window() function, where dither leads to adding random number in strided_input: Since "x.log()" and "x" here are based on the same random number, it makes "rand_gauss" not gaussian distributed: use_same_randuniform

However, in the kaldi source code, the random number is gaussian distributed.

Versions

v0.12.1 (stable release)

Aug 18 '22 07:08 yzwu2017

I can try to submit a PR for this, if it's still open.

Also @yzwu2017 RandGauss here is a standard normal I'm assuming? And, a link to the kaldi source code file referenced here would be helpful as well.

Aug 19 '22 14:08 ravicodelabs

Update: It looks like RandGauss here corresponds to the following, so I believe it should be a standard normal.

inline float RandGauss(struct RandomState* state = NULL) {
  return static_cast<float>(sqrtf (-2 * Log(RandUniform(state)))
                            * cosf(2*M_PI*RandUniform(state)));
}

source: https://github.com/kaldi-asr/kaldi/blob/cbed4ff688a172a7f765493d24771c1bd57dcd20/src/base/kaldi-math.h

Also for context, it looks like the previously reference Dither code is from here:

Aug 19 '22 14:08 ravicodelabs

It looks like the distribution is skewed due to torch.max(epsilon, ...). How about using torch.randn directly?

Aug 19 '22 21:08 mthrok

I see the bug now, nice catch @yzwu2017!

I would agree that using torch.randn would be a good solution - it should give similar results to kaldi and is also easier to understand. I can submit a PR for this.

As a side note: I don't think the distribution would be skewed, though. In particular, I think the max(epsilon, ...) was put there to avoid the possibility of generating a uniform variate that is exactly $0$, which will cause issues afterward when the logarithm is taken. Of course, this is irrelevant if we use np.randn instead of the Box Muller transform, which makes the former a good solution too, I think.

Aug 20 '22 19:08 ravicodelabs

audio audio copied to clipboard

torchaudio.compliance.kaldi.fbank dither is different from kaldi

🐛 Describe the bug

Versions

audio
audio copied to clipboard