haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Reopen Issue #1019

Open 4ut0m8NT opened this issue 2 years ago • 6 comments

Describe the bug Loading existing FAISS document store with saveed index/config no longer functions in 1.18.1

It will run once. Work, perform Q/A. Reload = FAIL.

Error message ValueError: The number of documents in the SQL database (96) doesn't match the number of embeddings in FAISS (0). Make sure your FAISS configuration file points to the same database that you used when you saved the original index.

Expected behavior Q/A App Loads and works just like first run.

Additional context Test Doc = converted PDF.

PreProcessing: converter = PDFToTextConverter(remove_numeric_tables=True) #doc_pdf = converter.convert(file_path="data/preprocessing_tutorial/bert.pdf", meta=None) doc = converter.convert(file_path=filename, meta={'name':str(filename)})

  processor = PreProcessor(
      clean_empty_lines=True,
      clean_whitespace=True,
      clean_header_footer=True,
      split_by="word",
      split_length=200,
      split_respect_sentence_boundary=True,
      split_overlap=0
    )
  docs = processor.process(doc)
  print (docs)
  document_store.write_documents(docs)
  document_store.save(index_path="./faissshift.index", config_path="./faiss.json") --> custom
  document_store.save("my_faiss"). --> double save operation to see if your example worked better... :(

To Reproduce Use farm-haystack 1.18.1

Run an embedded retriever with 384.

Attempt to reload a 2nd time.

FAQ Check

System: OS: Ubuntu GPU/CPU: GPU Haystack version (commit or version number): 1.18.1 DocumentStore: FAISSDocumentStore Reader: deepset/deberta-v3-base-injection Retriever: EmbeddingRetriever - sentence-transformers/all-MiniLM-L6-v2 (requires 384 dim)

my_faiss.json: {"faiss_index_factory_str": "Flat", "embedding_dim": 384, "index": "documents", "similarity": "cosine", "embedding_field": "question_emb", "sql_url": "sqlite:///faiss_document_store.db"}

my_faiss (index) (binary): "IxFI�^A^@^@^@^@^@^@^@^@^@^@^@^@^P^@^@^@^@^@^@^@^P^@^@^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@"

Please advise.

Also added to closed ticket #1019 .

4ut0m8NT avatar Jul 20 '23 19:07 4ut0m8NT

Hello @4ut0m8NT, we usually don't monitor closed issues.

Does this help? https://github.com/deepset-ai/haystack/issues/3961#issuecomment-1406213631

anakin87 avatar Jul 20 '23 19:07 anakin87

Thanks @anakin87, but this isn't a syntax issue:

document_store = FAISSDocumentStore.load(index_path="my_faiss", config_path="my_faiss.json")

it produces the "ValueError: The number of documents in the SQL database (96) doesn't " if the DB or index exists...

Please advise.

4ut0m8NT avatar Jul 20 '23 19:07 4ut0m8NT

document_store = FAISSDocumentStore(faiss_config_path="./my_faiss.json", faiss_index_path="./my_faiss")

Also a Fail. Please advise.

4ut0m8NT avatar Jul 20 '23 23:07 4ut0m8NT

Yes I get this as well. If I blow away the index and config files it will work just fine, the FAISS DocumentStore. However the save and load process no longer works.

demongolem-biz2 avatar Nov 02 '23 02:11 demongolem-biz2

Ok so I think that the tutorial which I was following at https://haystack.deepset.ai/integrations/faiss-document-store to use FAISS to perform semantic search needs to be updated because it does not show the process of saving the DocumentStore. I was performing save(), but I did not do update_embeddings() which was the crucial part I was missing. And then of course you have to update_embeddings() first and save() second so that the counts do match when you go to save.

The tutorial has two parts: the indexing pipeline followed by the query pipeline. The indexing pipeline sets up the FAISSDocumentStore and indexes. After this indexing is complete and before we run the query pipeline, that is where the update_embeddings() needs to be performed. I was anticipating it would be done during the indexing pipeline, however it is after we created the EmbeddingRetriever as part of the query pipeline, that is where the update_embeddings is run() and the save() performed. And I think for normal usage you would want to save and not just rerun this code over and over again and so that is why this process should be mentioned in the tutorial.

demongolem-biz2 avatar Nov 02 '23 13:11 demongolem-biz2

Initializing a FAISSDocumentStore can take 'faiss_index' and can also take 'index' If initializing with 'index', I also got the mismatched count error. I checked the code, the index param is ignored. So seems there's an issue with the docs and confusing naming in the params

augchan42 avatar Jan 25 '24 12:01 augchan42