instructor-embedding icon indicating copy to clipboard operation
instructor-embedding copied to clipboard

not getting exactly the same embedding for different batchsize

Open kirnap opened this issue 2 years ago • 5 comments

Hi,

I recently discovered that model.encode method does not give exactly the same embedding for different batch_size values. However, they're still close when I play with atol (absolute tolerance). Is this an expected behaviour or something buggy?

You may find minimal code snippet to replicate the conflicting embeddings:



import pandas as pd
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

query_instruction = 'Represent the Movie query for retrieving similar movies or tv shows: '
s1 = 'word'


batch = [[query_instruction, s1], 
         [query_instruction, s1], 
         [query_instruction, s1], 
         [query_instruction, s1]]

bbig2 = model.encode(batch, batch_size=2)
bbig4 = model.encode(batch, batch_size=4)
bbig1 = model.encode(batch, batch_size=1)

import numpy as np
if not np.allclose(bbig4, bbig1, atol=1e-8):
    print('Different batchsize is not close for 1e-8 absolute tolerance')
if np.allclose(bbig4, bbig1, atol=1e-7):
    print('Different batchsize is close enough for 1e-7 absolute tolerance')

This prints out the following results:

Different batchsize is not close for 1e-8 absolute tolerance
Different batchsize is close enough for 1e-7 absolute tolerance

thanks in advance!

kirnap avatar Aug 10 '23 16:08 kirnap

Any more findings on this yet?

aditya-y47 avatar Sep 25 '23 04:09 aditya-y47

Not from my end

kirnap avatar Sep 25 '23 10:09 kirnap

Most likely something to do with the underlying HF transformers package. It's a lot of finger pointing, but still no resolution at this point unfortunately. Relevant Github Issues: https://github.com/UKPLab/sentence-transformers/issues/2312 https://github.com/huggingface/transformers/issues/2401

dkirman-re avatar Nov 20 '23 18:11 dkirman-re

I'm having the same issue, it tried manipulating other things like order or content of the batch, the only factor that affects this is the batch size.

eyalyoli avatar Jan 24 '24 07:01 eyalyoli

Same here. I'm getting a different embedding for different batch_size. The embeddings start to differ from about the 7 decimal point.

ayalaall avatar Jan 24 '24 09:01 ayalaall