BRIO Ranking Loss Question

Hi - Thanks for the great code. I've been trying to re-implement BRIO in my HuggingFace fork, but unable to get it to work.

I'm curious what this line in RankingLoss is doing:

TotalLoss = loss_func(score, score, ones)

One possibility is that I haven't yet included the gold reference as part of the ranking loss, which might explain why the contrast loss is causing the gold standard MLE loss to rise too highly. I will add that but was also curious about the above function. Thank you!!

Aug 24 '22 14:08 griff4692

I also had a question about

loss_func = torch.nn.MarginRankingLoss(margin * i)

In the paper, it says

is the margin multiplied by the difference in rank between the candidates

It appears that the margin is based solely on the rank or index of the higher rated candidate. Is this correct?

Aug 24 '22 15:08 griff4692

Hi, thank you for your interest in our work.

I wanted to note that this loss function is adapted from MatchSum.

For TotalLoss, they have an explanation here is to avoid that some special samples will not go into the following for loop. I always think of it as just a placeholder.

For your second question about the margin, please refer to this thread: https://github.com/yixinL7/SimCLS/issues/6.

Please let me know if you have more questions.

Sep 06 '22 16:09 yixinL7

Ahh thanks Yixin -

Yes, I've noticed it's the same pairwise calculation from MatchSum. I see with TotalLoss -- just wanted to make sure it was meant to be an empty calculation.

I'm curious if you have any data comparing this pairwise ranking with other objectives:

Contrastive Loss: align positives in decoder latent space (CLIFF) ConSeq (Unlikelihood) Loss: CONSEQ

I'm working on a comparison of methods / metrics / positive-negative selection strategies but not for news summarization. It's interesting to see if adjusting the likelihood (as in unlikelihood and BRIO) is more effective than simple aligning positive decoder states (CLIFF paper, other non-summarization contrastive learning papers).

Sep 06 '22 19:09 griff4692

Hi Griffin, I have also found this comparison very interesting! My guess is adjusting the likelihood has a more direct impact on the decoding output than adjusting the latent representation, but I haven't tried to compare them empirically myself. I'm looking forward to seeing your work on this!

Sep 12 '22 03:09 yixinL7

BRIO BRIO copied to clipboard

Ranking Loss Question

BRIO
BRIO copied to clipboard