audio Wav2vec2.0 Pretrained model gives different emission results for different batch size input.

🐛 Describe the bug

I am trying to modify the example/asr/librispeech_ctc_decode/inference.py to a batch mode.

Here is my script: https://gist.github.com/yuekaizhang/f20904cfaf23e457a744f08ea19ce18e#file-inference_bug-py-L55

However, I found that with different batch_size, the WER drops a lot from 8%(batch size 2) to ~20%(batch size 32). I checked the emission tensors and found its value also changed with batch_size.

To reproduce:

python3 inference_bug.py     --librispeech_path ./librispeech/     --split test-other     --model WAV2VEC2_ASR_BASE_960H     --beam-size 10     --lm-weight 1.74     --word-score 0.52

Versions

Collecting environment information... PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.25.2 Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.4.0-89-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla V100-DGXS-32GB GPU 1: Tesla V100-DGXS-32GB GPU 2: Tesla V100-DGXS-32GB GPU 3: Tesla V100-DGXS-32GB

Nvidia driver version: 470.57.02 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2647.738 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 BogoMIPS: 4397.28 Virtualization: VT-x L1d cache: 640 KiB L1i cache: 640 KiB L2 cache: 5 MiB L3 cache: 50 MiB NUMA node0 CPU(s): 0-39 Vulnerability Itlb multihit: KVM: Vulnerable Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

Versions of relevant libraries: [pip3] k2==1.23.3.dev20230105+cuda11.7.torch1.13.1 [pip3] numpy==1.22.4 [pip3] torch==1.13.1 [pip3] torchaudio==2.0.0a0+3267c7e [conda] Could not collect

Feb 23 '23 13:02 yuekaizhang

Hi @yuekaizhang, my guess is the zero padding in the collate function introduces the discrepancy. To verify it, maybe you can use batch_size 1 and use the same length 16000 * 40 for each sample, and compute the WER again.

Feb 23 '23 13:02 nateanl

Hi @yuekaizhang, my guess is the zero padding in the collate function introduces the discrepancy. To verify it, maybe you can use batch_size 1 and use the same length 16000 * 40 for each sample, and compute the WER again.

Yes, you're right. I tried batch_size 1 and use the same length 16000 * 40 for each sample, the result is wrong. I was wondering which value I should pad if I have input wavs with different length.

Feb 23 '23 13:02 yuekaizhang

Another way is removing the output of padding zeros in the emission. You can compute the mapping from waveform length to emission frame length, and remove the frames by padding zeros for CTC decoding. In self-supervised learning training recipe there is code to estimate the mapping that you can use as reference: https://github.com/pytorch/audio/blob/main/examples/self_supervised_learning/data_modules/_utils.py#L360

Also you need to make sure all waveform lengths are shorter than 40 seconds, to avoid losing frames in the collate function.

Feb 23 '23 14:02 nateanl

Another way is removing the output of padding zeros in the emission. You can compute the mapping from waveform length to emission frame length, and remove the frames by padding zeros for CTC decoding. In self-supervised learning training recipe there is code to estimate the mapping that you can use as reference: https://github.com/pytorch/audio/blob/main/examples/self_supervised_learning/data_modules/_utils.py#L360

Also you need to make sure all waveform lengths are shorter than 40 seconds, to avoid losing frames in the collate function.

I think removing through the emission side doesn't work. From below pic, decoding batch size is 1, input wav all padding to 40s, however, I still get some emtpy results. Maybe your suggestion work if there is extra letters appended to the tail of correct prediction.

Feb 23 '23 14:02 yuekaizhang

Also, for the empty result, if I do torch.softmax(emission), then check every column, I found that blank probs are always more than 0.95 for every frame. So my guess the emission is totally wrong when input wavs with zero padding.

Feb 23 '23 14:02 yuekaizhang

There is a known issue with normalization, https://github.com/pytorch/audio/issues/2242

Feb 23 '23 14:02 mthrok

Thanks. I see. The WAR could be always set wav_lens = torch.tensor(wavforms.shape[1]).repeat(batch_size). Then using @nateanl 's method removing from emission side by calculating real emission_lengths manually. Correct me if I am wrong.

Feb 23 '23 14:02 yuekaizhang

As @mthrok pointed out, the only solution is setting batch_size to 1 to avoid the zero padding affecting group normalization..

Feb 23 '23 14:02 nateanl

We might be able to use MaskedTensor here.

cc @cpuhrsch

Feb 23 '23 14:02 mthrok

I think NestedTensor might help too.

Feb 23 '23 15:02 cpuhrsch

cc @hwangjeff who was interested in this as well

Feb 28 '23 20:02 xiaohui-zhang

audio audio copied to clipboard

Wav2vec2.0 Pretrained model gives different emission results for different batch size input.

🐛 Describe the bug

Versions

audio
audio copied to clipboard