The influence of t (temperature) in the E5 Model paper
Describe Model I am using (UniLM, MiniLM, LayoutLM ...): E5
hello. I am a student studying sentence similarity.
“Paper: Text Embeddings by Weakly-Supervised Contrastive Pre-training” While reading this paper, a question arose. The point is that t is 0.01. In the SimCSE paper, the sentence similarity is set to 0.05 for the task (STS), and in other papers, the sentence similarity is set to 0.02, but in this paper, the sentence similarity was set to 0.01. Can you tell us what effects can be achieved by lowering the temperature?
Hi @daegonYu ,
This is a hyperparameter for tuning. Empirically we observe that lower temperature will lead to better performance but might cause training instability under float16 precision for large models. A lower temperature allows the logits to vary in a wider range and thus has more flexibility.
“A lower temperature allows the logits to vary in a wider range and thus has more flexibility.” This can be interpreted as saying that embeddings make it easier to learn more diverse expressions. But in "https://huggingface.co/intfloat/multilingual-e5-base"
3. Why does the cosine similarity scores distribute around 0.7 to 1.0?
This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.
For text embedding tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this should not be an issue.
If embeddings can be expressed in wider range, I think cosine similarity should be distributed over a wide range. Cosine similarity is distributed between 0.7 and 1.0. It's difficult to understand because it seems like something contradictory. Simply put, I wonder why lowering the temperature allows learning a wider range of logits.
The logits are calculated with cosine_similarity / t. Therefore, the logits will fall in [-100, 100] with t = 0.01 and [-50, 50] with t=0.02, etc.
However, this does not mean the learned cosine similarity will be in a wider range. On the contrary, the cosine similarity tends to concentrate as the temperature becomes lower.
All right. I understand what you said, but why does "the cosine similarity tends to concentrate as the temperature becomes lower." Can you tell if this is happening?