SupContrast Question about SupConLoss implementation

Hi @HobbitLong. Thank you for the very great work. I have one question about your implementation of SupConLoss as I am comparing it with your paper.

In the supplementary of your paper, section 11 ("Effect of Number of Positives"), it is mentioned that the positives are also removed from the denominator of the loss function so they are not considered as negative. However, on line 89 here, the sum is taken over the whole exp_logits which still includes the positive samples in your denominator. Should we also need to mask out the positive samples from exp_logits, or am I understanding this incorrectly?

Thanks for your help.

Feb 02 '21 17:02 vkhoi

Hi, please follow equation 2.

Feb 02 '21 18:02 HobbitLong

Your implementation is exactly as equation 2. Then can you help me explain what does it mean by removing the positives from the denominator in supplementary section 11? Many thanks.

Feb 02 '21 19:02 vkhoi

This is a great question and I also stumbled upon the same question. In equation 2 the sum in the denominator sums over all elements in A(i), but A(i) was previously defined as just the set of augmented samples without the ith augmented sample. So in the case of e.g. binary labels, A(i) would still contain samples from both labels. But in the supervised contrastive loss we would then consider our actual positive examples as negatives, since we use the same definition of A(i).

Feb 16 '21 13:02 desch142

Any updates on this? Should the positive samples be masked out of the denominator sum?

Mar 06 '21 12:03 dimitrismallis

Sadly no. But I think that you that the positives samples need to be masked out, otherwise this would not make much sense. I tried to implement this loss with the masked out positives in the denominator and the loss went down in training, but i still have to tweak the hyperparamters

Mar 13 '21 12:03 desch142

I agree we need to mask out the positive samples in the denominator, since that is the motivation of using labels for contrastive learning right? I also observe lower training loss after masking out the positives, but I admit for better downstream task performance I still need to tweak the hyperparams a lot, so it's not 100% better all the time. Anyway, maybe leaving them in the denominator creates some label noise that may benefit learning.

Also guys, please check out this recent paper, especially Sec3.3 (UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning, arxiv). They mask out the positives in the denominator and present some comparisons with the SupCon loss.

Apr 01 '21 20:04 vkhoi

@vkhoi Hi, is this UniMococ paper still in arXiv??

Apr 17 '22 12:04 YangJae96

Sadly no. But I think that you that the positives samples need to be masked out, otherwise this would not make much sense. I tried to implement this loss with the masked out positives in the denominator and the loss went down in training, but i still have to tweak the hyperparamters

hi，do you have any updates on your experiments?

Jul 13 '22 06:07 liangzimei

Hi, all, I did not realize there is fruitful discussion here.

My personal understanding is that, eq (2) is mathematically right, in that it's just a cross-entropy between two distributions. The target distribution should be 1/|P(i)| for anchors that are positives, and 0 for anchors that are negatives. Note this target distribution is different from traditional ImageNet supervised learning where the target is one hot distribution, simply because here we have |P(i)| positives so we have to distribute the total mass 1 over all |P(i)|. If you think this way, then it's immediately clear that the predicted distribution should be a softmax normalization over all of anchors (both positive and negative), then it's mathematically right to include all positives in the denominator.

Practically, when we optimize the equation (2), I believe the optimizer realizes that we want to maximize the inner product for positives and minimize it for negatives, no matter we put positive pairs in the denominator or not. So when removing other positives as this thread discussed, we come up with a different surrogate objective which may not be as intuitive as equation 2. But such surrogate still works. The practical difference between this surrogate loss and the equation 2 still remain to be studied. What I can personally confirmed is that two years ago I have trained on ImageNet with eq 2 (combined with a momentum encoder trick), and I can get > 79% accuracy. So probably it's safe to keep positives in the denominator as well.

Jul 13 '22 07:07 HobbitLong

mark

Oct 10 '22 08:10 linshierge

Sadly no. But I think that you that the positives samples need to be masked out, otherwise this would not make much sense. I tried to implement this loss with the masked out positives in the denominator and the loss went down in training, but i still have to tweak the hyperparamters

Hi, I also tried to implement this loss but I realized that we may need different masks every time since we only consider one pos in numerator & denominator. Can we avoid a for loop for such loss? Otherwise it's not compute friendly compared with the default implementation.

Apr 16 '23 11:04 jlliRUC

SupContrast SupContrast copied to clipboard

Question about SupConLoss implementation

SupContrast
SupContrast copied to clipboard