easy-bert
easy-bert copied to clipboard
Results difficult to explain
Dear Rob, I do not know whether this is a bug or not but I have strange results, as per follows. I compare the embeddings of two words, and the average (on 768 values) absolute difference is lower for different word than for synonyms.
I would have expected a lower difference for rich and a greater for poor. Where am I actually wrong? Thank you.
Example 1:
String 1: wealthy
String 2: poor
Embedding 1 Embedding 2 100 * absolute difference
0.21383394 0.23239951 2.0
-0.0073103756 -0.057594057 5.0
0.09099525 0.11997495 3.0
...
Absolute difference average : 8
Example 2:
String 1: wealthy
String 2: blue
Embedding 1 Embedding 2 100 * absolute difference
0.21383394 0.29995522 9.0
-0.0073103756 -0.19767939 19.0
...
Absolute difference average : 16
Example 3:
String 1: wealthy
String 2: rich
Embedding 1 Embedding 2 100 * absolute difference
0.21383394 0.14642045 7.0
-0.0073103756 -0.108990476 10.0
0.09099525 0.25123212 16.0
0.069340415 -0.12602457 20.0
...
Absolute difference average : 11
Example 4:
String 1: wealthy
String 2: black
Embedding 1 Embedding 2 100 * absolute difference
0.21383394 0.22277042 1.0
-0.0073103756 -0.25720397 25.0
0.09099525 0.16640717 8.0
...
Absolute difference average : 11
Hey,
Are you using the token embeddings or the sequence embeddings in this case?
In my experience, the BERT sequence embeddings in particular (but sometimes also the token embeddings) don't do as good a job in raw distance calculations for semantic similarity as some other models. This is basically just a result of the tasks BERT is trained for and the transformer architecture it uses. Generally you might have better luck with cosine distance, as you won't have to worry about effects from embedding magnitudes.
That said, if you're looking to do this sort of thing (especially with individual words), you might want to look into a different model like Universal Sentence Encoder, ELMo, GloVe, etc. that's designed to better support semantic similarity w/simple distance metrics.
Hi Rob!
Thanks for your feedback.
I am using the sequence embedding that returns a float[]. Token embeddings return a float[][] and I don't know what to do with it to calculate a cosine similarity. An idea?
Following your advice, indeed the cosine similarity seems more reliable than a Manhattan or Euclidian distance, as per following Series 1 examples. Note however that, in Series 1, poor has a higher score (similarity) than wealthy with rich. But it remains not satisfying as per Series 2 example (where poor is far closer to wealthy than rich). Unless the situation can improve with token embeddings, I'm not sure that similarity in an n-dim space allows me to conclude precisely on same meaning between two words.
I will try with other models (which may be more difficult to use in Java but this is another story).
===================================================
EXAMPLES - FIRST SERIES
===================================================
Strings 1 & 2 : rich wealthy
Embedding 1 Embedding 2 |Difference|
0.14642045 0.21383394 0.06741349399089813
-0.108990476 -0.0073103756 0.10168009996414185
0.25123212 0.09099525 0.16023686528205872
...
Cosinus similarity = 0.8456862351607601
Manhattan distance : 87.27405425067991
Euclidian distance : 4.0416941747361665
===================================================
Strings 1 & 2 : rich poor
Embedding 1 Embedding 2 |Difference|
0.14642045 0.23239951 0.08597905933856964
-0.108990476 -0.057594057 0.05139641836285591
0.25123212 0.11997495 0.13125717639923096
...
Cosinus similarity = 0.8495916385797545
Manhattan distance : 87.00668240943924
Euclidian distance : 4.075972602921788
===================================================
Strings 1 & 2 : rich yellow
Embedding 1 Embedding 2 |Difference|
0.14642045 0.22856697 0.0821465253829956
-0.108990476 -0.30353695 0.19454647600650787
0.25123212 0.27586222 0.024630099534988403
...
Cosinus similarity = 0.8746363539860872
Manhattan distance : 83.51396567281336
Euclidian distance : 3.9368254652299233
===================================================
Strings 1 & 2 : rich blue
Embedding 1 Embedding 2 |Difference|
0.14642045 0.29995522 0.15353477001190186
-0.108990476 -0.19767939 0.0886889100074768
0.25123212 0.30732605 0.05609393119812012
...
Cosinus similarity = 0.8479855400362315
Manhattan distance : 97.23932060459629
Euclidian distance : 4.635928430207843
===================================================
Strings 1 & 2 : rich dumb
Embedding 1 Embedding 2 |Difference|
0.14642045 0.12908244 0.01733800768852234
-0.108990476 -0.031146867 0.07784360647201538
0.25123212 0.24095681 0.010275304317474365
...
Cosinus similarity = 0.8809615766086131
Manhattan distance : 77.28109940420836
Euclidian distance : 3.688068948233406
===================================================
EXAMPLES - SECOND SERIES
===================================================
Strings 1 & 2 : wealthy rich
Embedding 1 Embedding 2 |Difference|
0.21383394 0.14642045 0.06741349399089813
-0.0073103756 -0.108990476 0.10168009996414185
0.09099525 0.25123212 0.16023686528205872
...
Cosinus similarity = 0.8456862351607601
Manhattan distance : 87.27405425067991
Euclidian distance : 4.0416941747361665
===================================================
Strings 1 & 2 : wealthy poor
Embedding 1 Embedding 2 |Difference|
0.21383394 0.23239951 0.01856556534767151
-0.0073103756 -0.057594057 0.050283681601285934
0.09099525 0.11997495 0.028979696333408356
...
Cosinus similarity = 0.9146569176233622
Manhattan distance : 64.79049000190571
Euclidian distance : 2.9332529214490783
===================================================
Strings 1 & 2 : wealthy yellow
Embedding 1 Embedding 2 |Difference|
0.21383394 0.22856697 0.014733031392097473
-0.0073103756 -0.30353695 0.2962265610694885
0.09099525 0.27586222 0.18486696481704712
...
Cosinus similarity = 0.7631069329907343
Manhattan distance : 107.96488573867828
Euclidian distance : 5.212087989639763
===================================================
Strings 1 & 2 : wealthy blue
Embedding 1 Embedding 2 |Difference|
0.21383394 0.29995522 0.08612127602100372
-0.0073103756 -0.19767939 0.19036900997161865
0.09099525 0.30732605 0.21633079648017883
...
Cosinus similarity = 0.7371959850353489
Manhattan distance : 124.55361186526716
Euclidian distance : 5.906763527768454
===================================================
Strings 1 & 2 : wealthy dumb
Embedding 1 Embedding 2 |Difference|
0.21383394 0.12908244 0.08475150167942047
-0.0073103756 -0.031146867 0.023836491629481316
0.09099525 0.24095681 0.14996156096458435
...
Cosinus similarity = 0.7449719286008458
Manhattan distance : 101.83741049654782
Euclidian distance : 5.109428840001488
===================================================