DeepSpeaker-pytorch
DeepSpeaker-pytorch copied to clipboard
Minor issue with avg pooling
I could be wrong, since I normaly don't do speech verification.
But in my case, if I run the training it gives following error when affine transform is done to match the embedding dim after avg pool
"x = self.model.fc(x)" gives
RuntimeError: size mismatch, m1: [512 x 1024], m2: [2048 x 512] at /opt/conda/conda-bld/pytorch_1550813258230/work/aten/src/THC/generic/THCTensorMathBlas.cu:266
I think this is because avgpool is supposed to be on the temporal dimension by design, and in the commited version of the code, the avg pooling is done on frequency domain. avg pool2d is supposed to give [F, T] = [4, 2] => [4,1] but instead it gives [4, 2] => [1, 2] Thus the dimension after torch.view is half smaller than what is expected by the model.fc layer.
So I suggest
for myResNet.init()
Again, I'm no expert of speech verification. Anybody has another idea on how to fix that bug that is occuring to me, please please let me know.
@hash2430 hello! I find the same problem when I read the code, have you tried to avgpool the temporal dimension and the performance become better?