The results obtained by get text embedding batch and get text embedding are different

Open hanweidong opened this issue 5 months ago • 1 comments

I found that the vector results of Hangzhou City obtained by these two methods are different. What is the reason for this

code: ` model = HuggingFaceEmbedding( model_name='/home/nepf/hwd/bge-m3/', device="cpu", ) result_1 = model.get_text_embedding_batch( ['杭州市', '夜晚', '事故数量', '事故地点行政区划', '事故时间段', '白天'], show_progress=True) result_3 = model.get_text_embedding('杭州市')

print(result_1[0] == result_3) `

Aug 06 '25 03:08 hanweidong

probable cause: batch versus single embedding mismatch, usually tokenization, padding or nondeterministic post processing; maps to ProblemMap No.1 and can be diagnosed without changing infra using a semantic firewall. quick checks: compare raw token ids for the single item versus the batch, force identical preprocessing and padding, and run with deterministic settings and single thread. if you want the ProblemMap entry and a 60 second TXT OS quick start (we also have tesseract.js author endorsement and MIT license), say link and I will post it.

Aug 22 '25 13:08 onestardao