IS-BERT
IS-BERT copied to clipboard
why use MI instead of InfoNCE as loss function?
Hi, since you treat each sentence and its local context representations as positive examples, and treat all the local context representations from other sentences as negative examples, like what we usually do in contrastive learning, why do you choose MI as loss function instead of conventional CL loss like InfoNCE? Is MI better than InfoNCE in this scenario? Thanks!
Hi Thanks for the interest. Actually, both InfoNCE and JSD can be used for MI estimation. I just found that JSD works better when I was doing this work.