SupContrast
SupContrast copied to clipboard
Question about SupConLoss implementation
Hi @HobbitLong. Thank you for the very great work. I have one question about your implementation of SupConLoss as I am comparing it with your paper.
In the supplementary of your paper, section 11 ("Effect of Number of Positives"), it is mentioned that the positives are also removed from the denominator of the loss function so they are not considered as negative. However, on line 89 here, the sum is taken over the whole exp_logits which still includes the positive samples in your denominator. Should we also need to mask out the positive samples from exp_logits, or am I understanding this incorrectly?
Thanks for your help.
Hi, please follow equation 2.
Your implementation is exactly as equation 2. Then can you help me explain what does it mean by removing the positives from the denominator in supplementary section 11? Many thanks.
This is a great question and I also stumbled upon the same question. In equation 2 the sum in the denominator sums over all elements in A(i), but A(i) was previously defined as just the set of augmented samples without the ith augmented sample. So in the case of e.g. binary labels, A(i) would still contain samples from both labels. But in the supervised contrastive loss we would then consider our actual positive examples as negatives, since we use the same definition of A(i).
Any updates on this? Should the positive samples be masked out of the denominator sum?
Sadly no. But I think that you that the positives samples need to be masked out, otherwise this would not make much sense. I tried to implement this loss with the masked out positives in the denominator and the loss went down in training, but i still have to tweak the hyperparamters
I agree we need to mask out the positive samples in the denominator, since that is the motivation of using labels for contrastive learning right? I also observe lower training loss after masking out the positives, but I admit for better downstream task performance I still need to tweak the hyperparams a lot, so it's not 100% better all the time. Anyway, maybe leaving them in the denominator creates some label noise that may benefit learning.
Also guys, please check out this recent paper, especially Sec3.3 (UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning, arxiv). They mask out the positives in the denominator and present some comparisons with the SupCon loss.
@vkhoi Hi, is this UniMococ paper still in arXiv??
Sadly no. But I think that you that the positives samples need to be masked out, otherwise this would not make much sense. I tried to implement this loss with the masked out positives in the denominator and the loss went down in training, but i still have to tweak the hyperparamters
hi,do you have any updates on your experiments?
Hi, all, I did not realize there is fruitful discussion here.
My personal understanding is that, eq (2) is mathematically right, in that it's just a cross-entropy between two distributions. The target distribution should be 1/|P(i)| for anchors that are positives, and 0 for anchors that are negatives. Note this target distribution is different from traditional ImageNet supervised learning where the target is one hot distribution, simply because here we have |P(i)| positives so we have to distribute the total mass 1 over all |P(i)|. If you think this way, then it's immediately clear that the predicted distribution should be a softmax normalization over all of anchors (both positive and negative), then it's mathematically right to include all positives in the denominator.
Practically, when we optimize the equation (2), I believe the optimizer realizes that we want to maximize the inner product for positives and minimize it for negatives, no matter we put positive pairs in the denominator or not. So when removing other positives as this thread discussed, we come up with a different surrogate objective which may not be as intuitive as equation 2. But such surrogate still works. The practical difference between this surrogate loss and the equation 2 still remain to be studied. What I can personally confirmed is that two years ago I have trained on ImageNet with eq 2 (combined with a momentum encoder trick), and I can get > 79% accuracy. So probably it's safe to keep positives in the denominator as well.
mark
Sadly no. But I think that you that the positives samples need to be masked out, otherwise this would not make much sense. I tried to implement this loss with the masked out positives in the denominator and the loss went down in training, but i still have to tweak the hyperparamters
Hi, I also tried to implement this loss but I realized that we may need different masks every time since we only consider one pos in numerator & denominator. Can we avoid a for loop for such loss? Otherwise it's not compute friendly compared with the default implementation.