haystack icon indicating copy to clipboard operation
haystack copied to clipboard

FAISS embedding count error

Open OMGAmici opened this issue 11 months ago • 0 comments

Describe the bug When adding documents to the FAISS document store using the Embedding Retriever with sentence-transformers/msmarco-bert-base-dot-v5, I run into an occasional issue where the number of embeddings generated is greater than the number of documents, causing a misalignment between the .DB and .FAISS files that get saved. I have tracked this down to be due to the presence of certain characters in my documents because once these offending characters are removed, the error disappears.

Error message Embedding count in .FAISS does not match the document count in .DB file

Expected behavior Documents to get indexed to FAISS doc store and embeddings to be generated that are equal in number.

Additional context Remedial code I have had to add before pre-processing and indexing to get my documents to index correctly. But the list is growing because I am still tracking down a full list of all problematic characters.

text= text.replace('»', '').replace('\'','').replace('/','').replace(':','-').replace('(','').replace(')','').replace("*",'')
text= text.encode('ascii', errors='ignore').decode()

The line to encode/decode is to drop any non-English characters since these also seem to interfere with proper indexing.

To Reproduce Cohere wikipedia dataset I'm using on HuggingFace - I am trying to index the 'text' values as passages in the doc store. The first million passages are no issue, but thereafter it seems the weird character issue pops up.

FAQ Check

System:

  • OS: Linux
  • GPU/CPU: GPU
  • Haystack version (commit or version number): 1.17.1
  • DocumentStore: FAISS
  • Reader: NA
  • Retriever: EmbeddingRetriever

OMGAmici avatar Sep 21 '23 20:09 OMGAmici