SupContrast icon indicating copy to clipboard operation
SupContrast copied to clipboard

Question about SupConLoss implementation

Open vkhoi opened this issue 4 years ago • 11 comments

Hi @HobbitLong. Thank you for the very great work. I have one question about your implementation of SupConLoss as I am comparing it with your paper.

In the supplementary of your paper, section 11 ("Effect of Number of Positives"), it is mentioned that the positives are also removed from the denominator of the loss function so they are not considered as negative. However, on line 89 here, the sum is taken over the whole exp_logits which still includes the positive samples in your denominator. Should we also need to mask out the positive samples from exp_logits, or am I understanding this incorrectly?

Thanks for your help.

vkhoi avatar Feb 02 '21 17:02 vkhoi

Hi, please follow equation 2.

HobbitLong avatar Feb 02 '21 18:02 HobbitLong

Your implementation is exactly as equation 2. Then can you help me explain what does it mean by removing the positives from the denominator in supplementary section 11? Many thanks.

vkhoi avatar Feb 02 '21 19:02 vkhoi

This is a great question and I also stumbled upon the same question. In equation 2 the sum in the denominator sums over all elements in A(i), but A(i) was previously defined as just the set of augmented samples without the ith augmented sample. So in the case of e.g. binary labels, A(i) would still contain samples from both labels. But in the supervised contrastive loss we would then consider our actual positive examples as negatives, since we use the same definition of A(i).

desch142 avatar Feb 16 '21 13:02 desch142

Any updates on this? Should the positive samples be masked out of the denominator sum?

dimitrismallis avatar Mar 06 '21 12:03 dimitrismallis

Sadly no. But I think that you that the positives samples need to be masked out, otherwise this would not make much sense. I tried to implement this loss with the masked out positives in the denominator and the loss went down in training, but i still have to tweak the hyperparamters

desch142 avatar Mar 13 '21 12:03 desch142

I agree we need to mask out the positive samples in the denominator, since that is the motivation of using labels for contrastive learning right? I also observe lower training loss after masking out the positives, but I admit for better downstream task performance I still need to tweak the hyperparams a lot, so it's not 100% better all the time. Anyway, maybe leaving them in the denominator creates some label noise that may benefit learning.

Also guys, please check out this recent paper, especially Sec3.3 (UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning, arxiv). They mask out the positives in the denominator and present some comparisons with the SupCon loss.

vkhoi avatar Apr 01 '21 20:04 vkhoi

@vkhoi Hi, is this UniMococ paper still in arXiv??

YangJae96 avatar Apr 17 '22 12:04 YangJae96

Sadly no. But I think that you that the positives samples need to be masked out, otherwise this would not make much sense. I tried to implement this loss with the masked out positives in the denominator and the loss went down in training, but i still have to tweak the hyperparamters

hi,do you have any updates on your experiments?

liangzimei avatar Jul 13 '22 06:07 liangzimei

Hi, all, I did not realize there is fruitful discussion here.

My personal understanding is that, eq (2) is mathematically right, in that it's just a cross-entropy between two distributions. The target distribution should be 1/|P(i)| for anchors that are positives, and 0 for anchors that are negatives. Note this target distribution is different from traditional ImageNet supervised learning where the target is one hot distribution, simply because here we have |P(i)| positives so we have to distribute the total mass 1 over all |P(i)|. If you think this way, then it's immediately clear that the predicted distribution should be a softmax normalization over all of anchors (both positive and negative), then it's mathematically right to include all positives in the denominator.

Practically, when we optimize the equation (2), I believe the optimizer realizes that we want to maximize the inner product for positives and minimize it for negatives, no matter we put positive pairs in the denominator or not. So when removing other positives as this thread discussed, we come up with a different surrogate objective which may not be as intuitive as equation 2. But such surrogate still works. The practical difference between this surrogate loss and the equation 2 still remain to be studied. What I can personally confirmed is that two years ago I have trained on ImageNet with eq 2 (combined with a momentum encoder trick), and I can get > 79% accuracy. So probably it's safe to keep positives in the denominator as well.

HobbitLong avatar Jul 13 '22 07:07 HobbitLong

mark

linshierge avatar Oct 10 '22 08:10 linshierge

Sadly no. But I think that you that the positives samples need to be masked out, otherwise this would not make much sense. I tried to implement this loss with the masked out positives in the denominator and the loss went down in training, but i still have to tweak the hyperparamters

Hi, I also tried to implement this loss but I realized that we may need different masks every time since we only consider one pos in numerator & denominator. Can we avoid a for loop for such loss? Otherwise it's not compute friendly compared with the default implementation.

jlliRUC avatar Apr 16 '23 11:04 jlliRUC