easy-bert icon indicating copy to clipboard operation
easy-bert copied to clipboard

Results difficult to explain

Open lbonansbrux opened this issue 5 years ago • 2 comments

Dear Rob, I do not know whether this is a bug or not but I have strange results, as per follows. I compare the embeddings of two words, and the average (on 768 values) absolute difference is lower for different word than for synonyms.

I would have expected a lower difference for rich and a greater for poor. Where am I actually wrong? Thank you.

Example 1:

String 1: wealthy
String 2: poor
Embedding 1	Embedding 2	100 * absolute difference
0.21383394	0.23239951	2.0
-0.0073103756	-0.057594057	5.0
0.09099525	0.11997495	3.0
...
Absolute difference average : 8

Example 2:

String 1: wealthy
String 2: blue
Embedding 1	Embedding 2	100 * absolute difference
0.21383394	0.29995522	9.0
-0.0073103756	-0.19767939	19.0
...
Absolute difference average : 16

Example 3:

String 1: wealthy
String 2: rich
Embedding 1	Embedding 2	100 * absolute difference
0.21383394	0.14642045	7.0
-0.0073103756	-0.108990476	10.0
0.09099525	0.25123212	16.0
0.069340415	-0.12602457	20.0
...
Absolute difference average : 11

Example 4:

String 1: wealthy
String 2: black
Embedding 1	Embedding 2	100 * absolute difference
0.21383394	0.22277042	1.0
-0.0073103756	-0.25720397	25.0
0.09099525	0.16640717	8.0
...
Absolute difference average : 11

lbonansbrux avatar Apr 24 '20 17:04 lbonansbrux

Hey,

Are you using the token embeddings or the sequence embeddings in this case?

In my experience, the BERT sequence embeddings in particular (but sometimes also the token embeddings) don't do as good a job in raw distance calculations for semantic similarity as some other models. This is basically just a result of the tasks BERT is trained for and the transformer architecture it uses. Generally you might have better luck with cosine distance, as you won't have to worry about effects from embedding magnitudes.

That said, if you're looking to do this sort of thing (especially with individual words), you might want to look into a different model like Universal Sentence Encoder, ELMo, GloVe, etc. that's designed to better support semantic similarity w/simple distance metrics.

robrua avatar Apr 25 '20 19:04 robrua

Hi Rob!

Thanks for your feedback.

I am using the sequence embedding that returns a float[]. Token embeddings return a float[][] and I don't know what to do with it to calculate a cosine similarity. An idea?

Following your advice, indeed the cosine similarity seems more reliable than a Manhattan or Euclidian distance, as per following Series 1 examples. Note however that, in Series 1, poor has a higher score (similarity) than wealthy with rich. But it remains not satisfying as per Series 2 example (where poor is far closer to wealthy than rich). Unless the situation can improve with token embeddings, I'm not sure that similarity in an n-dim space allows me to conclude precisely on same meaning between two words.

I will try with other models (which may be more difficult to use in Java but this is another story).

===================================================
EXAMPLES - FIRST SERIES
===================================================
Strings 1 & 2 : rich	wealthy
Embedding 1	Embedding 2	|Difference|
0.14642045	0.21383394	0.06741349399089813
-0.108990476	-0.0073103756	0.10168009996414185
0.25123212	0.09099525	0.16023686528205872
...
Cosinus similarity = 0.8456862351607601
Manhattan distance : 87.27405425067991
Euclidian distance : 4.0416941747361665
===================================================
Strings 1 & 2 : rich	poor
Embedding 1	Embedding 2	|Difference|
0.14642045	0.23239951	0.08597905933856964
-0.108990476	-0.057594057	0.05139641836285591
0.25123212	0.11997495	0.13125717639923096
...
Cosinus similarity = 0.8495916385797545
Manhattan distance : 87.00668240943924
Euclidian distance : 4.075972602921788
===================================================
Strings 1 & 2 : rich	yellow
Embedding 1	Embedding 2	|Difference|
0.14642045	0.22856697	0.0821465253829956
-0.108990476	-0.30353695	0.19454647600650787
0.25123212	0.27586222	0.024630099534988403
...
Cosinus similarity = 0.8746363539860872
Manhattan distance : 83.51396567281336
Euclidian distance : 3.9368254652299233
===================================================
Strings 1 & 2 : rich	blue
Embedding 1	Embedding 2	|Difference|
0.14642045	0.29995522	0.15353477001190186
-0.108990476	-0.19767939	0.0886889100074768
0.25123212	0.30732605	0.05609393119812012
...
Cosinus similarity = 0.8479855400362315
Manhattan distance : 97.23932060459629
Euclidian distance : 4.635928430207843
===================================================
Strings 1 & 2 : rich	dumb
Embedding 1	Embedding 2	|Difference|
0.14642045	0.12908244	0.01733800768852234
-0.108990476	-0.031146867	0.07784360647201538
0.25123212	0.24095681	0.010275304317474365
...
Cosinus similarity = 0.8809615766086131
Manhattan distance : 77.28109940420836
Euclidian distance : 3.688068948233406
===================================================
EXAMPLES - SECOND SERIES
===================================================
Strings 1 & 2 : wealthy	rich
Embedding 1	Embedding 2	|Difference|
0.21383394	0.14642045	0.06741349399089813
-0.0073103756	-0.108990476	0.10168009996414185
0.09099525	0.25123212	0.16023686528205872
...
Cosinus similarity = 0.8456862351607601
Manhattan distance : 87.27405425067991
Euclidian distance : 4.0416941747361665
===================================================
Strings 1 & 2 : wealthy	poor
Embedding 1	Embedding 2	|Difference|
0.21383394	0.23239951	0.01856556534767151
-0.0073103756	-0.057594057	0.050283681601285934
0.09099525	0.11997495	0.028979696333408356
...
Cosinus similarity = 0.9146569176233622
Manhattan distance : 64.79049000190571
Euclidian distance : 2.9332529214490783
===================================================
Strings 1 & 2 : wealthy	yellow
Embedding 1	Embedding 2	|Difference|
0.21383394	0.22856697	0.014733031392097473
-0.0073103756	-0.30353695	0.2962265610694885
0.09099525	0.27586222	0.18486696481704712
...
Cosinus similarity = 0.7631069329907343
Manhattan distance : 107.96488573867828
Euclidian distance : 5.212087989639763
===================================================
Strings 1 & 2 : wealthy	blue
Embedding 1	Embedding 2	|Difference|
0.21383394	0.29995522	0.08612127602100372
-0.0073103756	-0.19767939	0.19036900997161865
0.09099525	0.30732605	0.21633079648017883
...
Cosinus similarity = 0.7371959850353489
Manhattan distance : 124.55361186526716
Euclidian distance : 5.906763527768454
===================================================
Strings 1 & 2 : wealthy	dumb
Embedding 1	Embedding 2	|Difference|
0.21383394	0.12908244	0.08475150167942047
-0.0073103756	-0.031146867	0.023836491629481316
0.09099525	0.24095681	0.14996156096458435
...
Cosinus similarity = 0.7449719286008458
Manhattan distance : 101.83741049654782
Euclidian distance : 5.109428840001488
===================================================

lbonansbrux avatar May 02 '20 10:05 lbonansbrux