DeepSpeaker-pytorch icon indicating copy to clipboard operation
DeepSpeaker-pytorch copied to clipboard

Minor issue with avg pooling

Open hash2430 opened this issue 5 years ago • 1 comments

I could be wrong, since I normaly don't do speech verification. But in my case, if I run the training it gives following error when affine transform is done to match the embedding dim after avg pool "x = self.model.fc(x)" gives RuntimeError: size mismatch, m1: [512 x 1024], m2: [2048 x 512] at /opt/conda/conda-bld/pytorch_1550813258230/work/aten/src/THC/generic/THCTensorMathBlas.cu:266

I think this is because avgpool is supposed to be on the temporal dimension by design, and in the commited version of the code, the avg pooling is done on frequency domain. avg pool2d is supposed to give [F, T] = [4, 2] => [4,1] but instead it gives [4, 2] => [1, 2] Thus the dimension after torch.view is half smaller than what is expected by the model.fc layer.

image

So I suggest image

for myResNet.init()

Again, I'm no expert of speech verification. Anybody has another idea on how to fix that bug that is occuring to me, please please let me know.

hash2430 avatar Sep 18 '19 06:09 hash2430

@hash2430 hello! I find the same problem when I read the code, have you tried to avgpool the temporal dimension and the performance become better?

fangmq avatar Mar 10 '22 09:03 fangmq