not getting exactly the same embedding for different batchsize
Hi,
I recently discovered that model.encode method does not give exactly the same embedding for different batch_size values. However, they're still close when I play with atol (absolute tolerance). Is this an expected behaviour or something buggy?
You may find minimal code snippet to replicate the conflicting embeddings:
import pandas as pd
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')
query_instruction = 'Represent the Movie query for retrieving similar movies or tv shows: '
s1 = 'word'
batch = [[query_instruction, s1],
[query_instruction, s1],
[query_instruction, s1],
[query_instruction, s1]]
bbig2 = model.encode(batch, batch_size=2)
bbig4 = model.encode(batch, batch_size=4)
bbig1 = model.encode(batch, batch_size=1)
import numpy as np
if not np.allclose(bbig4, bbig1, atol=1e-8):
print('Different batchsize is not close for 1e-8 absolute tolerance')
if np.allclose(bbig4, bbig1, atol=1e-7):
print('Different batchsize is close enough for 1e-7 absolute tolerance')
This prints out the following results:
Different batchsize is not close for 1e-8 absolute tolerance
Different batchsize is close enough for 1e-7 absolute tolerance
thanks in advance!
Any more findings on this yet?
Not from my end
Most likely something to do with the underlying HF transformers package. It's a lot of finger pointing, but still no resolution at this point unfortunately. Relevant Github Issues: https://github.com/UKPLab/sentence-transformers/issues/2312 https://github.com/huggingface/transformers/issues/2401
I'm having the same issue, it tried manipulating other things like order or content of the batch, the only factor that affects this is the batch size.
Same here. I'm getting a different embedding for different batch_size. The embeddings start to differ from about the 7 decimal point.