PerceptualAudio_Pytorch
PerceptualAudio_Pytorch copied to clipboard
issue with forward loop in the models.py
Hi Adrien, I was actually just now going through your codebase and found something that I wanted to confirm. In models.py, line 139, you take the CE loss of the output of the Classification network and the actual label. Is that correct? In my model, I take the softmax of the outputs of the Classification network and take the CE loss with the actual labels. I just wanted to make sure that you follow that regime. Does that make sense?
Thanks! Pranay
Hi Pranay,
Thank you a lot for looking into it and all the useful feedbacks so far. I will update the codes and upload a pretrained model with the final test scores I got. I let you know when it is done, by tomorrow I try to do that.
About your question on l.139: loss = self.CE(pred,labels.squeeze())
pred is the raw output of the classifier, unbouded (before softmax) self.CE is pytorch's nn.CrossEntropyLoss "This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class."
labels is the groundtruth binary human rating
correct me if I am wrong, this follows the training regime you are applying for your paper ?
Thanks ! Adrien
I updated the codes and tried to clarify in the readme some parameters I added.
pretrain.py should allow to load one of 3 models that were trained on different subsets and with a final test loss getting ~ close to 0.55 ; although it is borderline with the target loss you recommended, I did not manage to get below, so far !
If you spot some mistakes or weird behavior, or if there are some issues with running the codes, please let me know and I fix it.
Thank you for trying that out, maybe we get a well working pytorch version ! Adrien
Hi Adrien, Thanks for the comments. I am not too familiar with pytorch, and so your comment above makes complete sense and that is what I was trying to refer to in my paper. Thanks for the clarification! As for comments on the codebase, first, thanks a lot for making up the codebase. I really appreciate the effort that you put in. I just found one weird behaviour using bs (batch_size =1). So if you look at line 143 on models.py, you can write torch.squeze(labels,-1) which would just squeeze only one dimension. Everything else looks good to me until now. I also ran the code on my sample inputs ( in my repo - under the sample_audio folder. Those seemed to make sense as well). I will definitely investigate this codebase further but as of now everything looks great! Would it be okay for you if I post your github repo link as well on my github page so that people who use pytorch could give it a try?
Thanks, Pranay
Hi Pranay,
Thank you for making the effort to have a look into the Pytorch coding. On my side I am not so familiar with TF so I try my best when I have to read it, which is not always easy.
I modified models.py, but as far as I defined the input shapes, it should be the same to put squeeze(1) or squeeze(-1). I guess there can be some confusion from my weird choice to shape labels as [batch,label] instead of a flat tensor [batch] ? It is kind of an habit, no matter the number of features/labels, I always keep the first dimension for the batch size and the additional dimensions for the sample sizes.
In pretrain.py I actually make a forward "check" for a dummy batch of size 1 ; but I input a fake label of shape [1,1]. And it seems to forward correctly. Or are you still having issues with it ?
About the codes you ran with the pretrained models I provide, do they seem to indicate the training is decently successful ? Because I did not implement any of the further evaluations you develop in the paper, so I cannot compare the result quality wrt. to your results.
And of course, I am happy to be linked in your official repository ! That would be good to have some other pytorch fellows to experiment with this base and maybe train better models that what I did so far.
Thanks, Adrien
Hi Adrien,
I was reviewing the codebase and have one more observation - models.py line 71.
diff.shape[0]/diff.shape[1]
. Shouldn't it be diff.shape[1]/diff.shape[2]
?
The idea is to take the mean of the feature maps. You use torch.sum and divide by the batch_size and the audio length, whereas you should be dividing it by the channel size and the audio length as we are going from [batch_size,audio_length,channels] to [batch_size]. Does that look okay?
Thanks, Pranay
Hi Pranay,
Thank you for pointing this out. Equation (1) in the paper only divides by the time dimension, although it doesn't make sense I divided by the batch size too since this dimension is not reduced .. The CE loss is averaging by the batch size afterwards.
I am correcting models.py line 71 according to your suggestion, that the average is as well on the channel dimension.
I am sending again a training with this correction to see how it goes
Thanks, Adrien
Hi Pranay,
I updated the pretrain.py and replaced pretrained models with two that were trained with the correct distance averaging (time and channel dimensions)
The models are 'dataset_combined_linear' and 'dataset_combined_linear_tshrink' ; they both trained on the subsets combined+linear, the second had the tanhshrink activation on the distance.
Test losses are respectively 0.569 and 0.564 ; I put the performance details in comment of pretrain.py, the tanhshrink activation tends to push the distance for similar audio closer to 0 and has a biggest average ratio between label 0 and 1 pairs. Which maybe indicates that it is useful to have this saturation close to 0 and increasing gradient with the increase of the distance .. not sure though, I put both but that was my "intuition" about adding an activation to the distance.
If you cannot load them, please let me know any troubles. And if you have time for giving them a try, I'm happy to hear about it. I still run a few trainings, to see if I can get them performing better on test CE loss which is still imperfect ..
Thanks, Adrien