audio icon indicating copy to clipboard operation
audio copied to clipboard

Fbank features are different from Kaldi Fbank

Open jooan84 opened this issue 5 years ago • 11 comments

🐛 Bug

The output of the fbank feature calculations differs from that of kaldi.

To Reproduce

Steps to reproduce the behavior:

using the following or even the defaults parameters:

 torchaudio.compliance.kaldi.fbank(waveform, blackman_coeff=0.42, channel=-1, dither=1.0, energy_floor=0.0, frame_length=25.0, frame_shift=10.0, high_freq=0.0, htk_compat=True, low_freq=20.0, min_duration=0.0, num_mel_bins=40, preemphasis_coefficient=0.97, raw_energy=True, remove_dc_offset=True, round_to_power_of_two=True, sample_frequency=16000.0, snip_edges=True, subtract_mean=False, use_energy=False, use_log_fbank=True,use_power=True, vtln_high=-500.0, vtln_low=100.0, vtln_warp=1.0, window_type='hamming')[0]

produce this output:

tensor([-0.7616, -0.4791,  0.2155,  0.7661,  2.0723,  1.4565,  2.9888,  3.2548,
         1.8460,  3.5807,  3.8290,  4.1785,  4.6776,  4.5801,  5.3610,  4.4910,
         5.1519,  5.3534,  5.2783,  5.6159,  6.0689,  5.5961,  5.8068,  5.0957,
         6.5200,  6.9314,  6.1741,  7.0430,  7.9394,  8.2380,  8.7115,  8.4105,
         8.3154,  8.2186,  7.9444,  8.4468,  8.4293,  8.9476,  9.1008,  9.2495])

with compute_fbank_feats of Kaldi

tensor([12.9911, 12.9795, 12.9127, 13.6171, 13.7416, 15.1579, 15.1996, 14.9468,
        14.1368, 14.8717, 14.8265, 13.8715, 15.2716, 15.0743, 15.2439, 15.3904,
        13.9460, 13.5932, 14.0038, 14.8721, 13.9944, 15.8337, 14.8682, 13.8247,
        15.0769, 15.1141, 15.1482, 14.7864, 13.6259, 14.4092, 14.1771, 13.6139,
        13.8014, 12.5796,  9.1051,  8.3382,  8.3738,  8.7829,  9.2973,  9.4913])

jooan84 avatar Jan 10 '20 15:01 jooan84

  • Can you provide the kaldi command you used?
  • Can you provide a sample file so we can reproduce?
  • Note that you are using dither=1.0 which adds dither.
  • See also #332.

vincentqb avatar Jan 13 '20 16:01 vincentqb

I looked into this and took a while to figure out why.

When you use fbank function, you need to normalize the audio and for that you need to use torchaudio.load_wav function instead of torchaudio.load.

See my test or existing test.

This is extremely subtle.

mthrok avatar Apr 17 '20 23:04 mthrok

@mthrok - should we add documentation about this or otherwise try to prevent this issue coming up again in the future? I'm surprised we have a need for a separate load_wav to begin with.

cpuhrsch avatar Apr 29 '20 04:04 cpuhrsch

I second @cpuhrsch: I'm also surprised that we torchaudio.load does not work here.

vincentqb avatar Apr 29 '20 05:04 vincentqb

I don't believe we should rely on load_wav to fix this issue.

vincentqb avatar May 01 '20 21:05 vincentqb

edit: After some testing it seems to get the closest match one has to do no normalisation but times by 2**15 ?

@mthrok normalising audio does not help for me, code:

    data, fs = sf.read('/idiap/resource/database/LibriSpeech/train-clean-360/100/121669/100-121669-0000.flac')
    data = to.from_numpy(data).float()
    data /= data.max()
    f = fbank(data.unsqueeze(0), num_mel_bins=40, low_freq=40, high_freq=7600)

    kaldi_feats = None
    for uttid, m in kaldi_io.read_mat_scp('scp:feats.scp'):
        kaldi_feats = m
    print(uttid)
    print(kaldi_feats[:2])
    print(f[:2])

output

    100-121669-0000-1
    [[ 8.129056   7.732553   7.6204824  6.776312   7.437045   8.823427
   8.736998   8.304144   8.411314   8.19662    6.130655   8.646175
   9.119083   9.085771   8.314858   9.277414   9.7172785  9.830122
   9.228786   9.078177   9.063866   9.667826   8.975353   9.46149
   9.655378   9.932469   9.935007  10.056624   9.357061  10.264997
  10.36901   10.563572  10.689384  11.149243  11.518983  10.866757
  10.359279  10.542366  11.021458  10.561819 ]
 [ 8.081877   7.8777122  6.87261    8.406      9.237014   8.542725
   7.0748315  7.555811   8.742043   9.1879     7.651375   7.56339
   8.07299    9.343008   9.155113   9.235215   9.285145   9.729772
   9.2692585  9.870285  10.123455   9.58822    9.321457   9.46149
   9.285657   9.631441  11.042232  10.012186   9.731838   9.504875
  10.895826  10.652676  10.899666  10.996901  10.666897  11.006931
  10.998066  11.225334  11.071218  10.741457 ]]
tensor([[-10.5861, -10.9795, -11.1278, -11.9309, -11.2997,  -9.8805,  -9.8985,
         -10.3205, -10.3428, -10.5305, -12.9941, -10.0486,  -9.5567,  -9.6991,
         -10.3325,  -9.3442,  -8.9814,  -8.8237,  -9.3472,  -9.6113,  -9.7424,
          -8.9508,  -9.7846,  -9.3923,  -8.8430,  -8.8997,  -8.7163,  -8.5314,
          -9.2710,  -8.6714,  -8.3952,  -8.3978,  -8.0870,  -7.5590,  -7.4100,
          -7.9227,  -8.4362,  -8.7195,  -8.0624,  -8.5884],
        [-10.5894, -10.7786, -11.8293, -10.2971,  -9.4618, -10.1934, -11.7973,
         -11.3098, -10.0636,  -9.5083, -10.8814, -11.2168, -10.6213,  -9.4451,
          -9.5788,  -9.5073,  -9.5189,  -8.9797,  -9.5143,  -8.6416,  -8.4359,
          -9.1466,  -9.2892,  -9.3173,  -9.4014,  -9.2642,  -7.6490,  -8.6838,
          -9.0432,  -9.5034,  -7.9339,  -7.9784,  -7.9248,  -7.8987,  -8.2526,
          -8.2896,  -8.0052,  -7.9586,  -8.1519,  -8.1042]])

~~Also in my opinion if this is an important requirement then the function should check that the max is equal to 1 and warn otherwise.~~

Btw I don't think it's good to make the assumption of normalising audio as you can't do this in a realtime setting.

RuABraun avatar Jan 04 '21 21:01 RuABraun

Hi @RuABraun

As you figured out, normalization here means dtype conversion, that is float (with value range [-1, 1]) to int16 (with value range [-32,768, 32,767].

According to my recent talk with @cpuhrsch, this fbank feature is not intended for precise match with the Kaldi's implementation.

I found that our test suite for this function which I thought was covering it was not enough and it does not match the Kaldi's result.

I personally think that it is more confusing to have a module named compliance, which is implicitly not meant to match. Also we are getting rid of load_wav function, so we do need to change things around compliance.kaldi module.

To lower the maintenance cost, I am in favor of building Kaldi and binding, which guarantees all the Kaldi related features to match perfectly with Kaldi's result but that opinion is not getting a support from anyone.

Similar issue is raised at #328

mthrok avatar Jan 05 '21 15:01 mthrok

Thank you for the explanation! :)

RuABraun avatar Jan 05 '21 17:01 RuABraun

We also had the same problem two days ago under the setting subtrach_mean = False. We compared the results of torchaudio's fbank and kaldi's compute-fbank-feats line-by-line. The differences occured from the values of input. It is really confusing that the input of torchaudio's fbank should be float number in the range of [-32,768., 32,767.] ( not float [-1.,1.] or int16 [-32,768, 32,767]). We fixed the problem by loading one piece of 16-bit .wav with dtype='int16' and converted the signal value to float directly without any normalization. e.g. We converted the value -3 to -3.0. After fixing, the result shows that:

err between kaldi and torchaudio res (1.2798111e-07, 1.177518e-05, 7.390976e-05)
kaldi res: tensor([[ 7.7390,  6.6414,  5.9847,  ..., 11.2153, 10.8115, 10.6624],
        [ 8.3844,  7.3069,  5.6935,  ..., 11.3059, 11.9750, 11.1324],
        [ 5.9230,  4.5791,  6.4441,  ..., 11.5842, 12.4497, 11.9442],
        ...,
        [ 8.3075,  7.4419,  6.3531,  ...,  8.7440,  9.0616,  8.9001],
        [ 7.9940,  7.1240,  4.0873,  ...,  8.4048,  8.6729,  9.1240],
        [ 8.7946,  6.5140,  6.0803,  ...,  8.8812,  8.5578,  8.0560]])

torchaudio res tensor([[ 7.7390,  6.6414,  5.9847,  ..., 11.2153, 10.8115, 10.6624],
        [ 8.3844,  7.3069,  5.6935,  ..., 11.3059, 11.9750, 11.1324],
        [ 5.9230,  4.5791,  6.4441,  ..., 11.5842, 12.4497, 11.9442],
        ...,
        [ 8.3075,  7.4419,  6.3531,  ...,  8.7440,  9.0616,  8.9001],
        [ 7.9940,  7.1240,  4.0873,  ...,  8.4048,  8.6729,  9.1240],
        [ 8.7946,  6.5140,  6.0803,  ...,  8.8812,  8.5578,  8.0560]])

njusq avatar Dec 16 '21 02:12 njusq

Can you please share your code? it will be very useful!

Wonder1905 avatar Mar 29 '22 07:03 Wonder1905

Can you please share your code? it will be very useful!

@BattashB

Something like this

waveform, sample_rarte = torchaudio.load(<file>)  # waveform is float32, value range [-1, 1]
waveform = waveform * (2 << 16) # convert the value range to  [-32,768., 32,767.]

mthrok avatar Mar 29 '22 17:03 mthrok