haystack icon indicating copy to clipboard operation
haystack copied to clipboard

float32 json serialization error in FAISSDocumentStore with EntityExtractor in indexing pipeline

Open Haradai opened this issue 2 years ago • 2 comments

Describe the bug Using FAISSDocumentStore(faiss_index_factory_str="Flat") document store I had a simple pipeline crawler -> preprocessor -> store, that was working well. When I introduced an EntityExtractor just after the preprocessor I get an error when saving the documents. I think it could be due to that the NER score in the document metadata is a float.

Error message

Exception: Exception while running node 'document_store': (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(builtins.TypeError) Object of type float32 is not JSON serializable
[SQL: INSERT INTO meta_document (id, name, value, document_id, document_index) VALUES (?, ?, ?, ?, ?)]
[parameters: [{'document_id': '661e6ab86b528239f63dd4b4a28fe9b', 'document_index': 'document', 'name': 'entities', 'value': [{'entity_group': 'LOC', 'score': 0.99963534, 'word': 'Norway', 'start': 7, 'end': 13}, {'entity_group': 'LOC', 'score': 0.99975854, 'word': 'Norway', 'start': 136, 'end': 142}]}]]
Enable debug logging to see the data that was passed when the pipeline failed.

Expected behavior A clear and concise description of what you expected to happen.

Additional context Maybe this could be solved by having the output metadata of the NER extraction all be saved as a string?

To Reproduce

# Note: I am using M1 gpu so I have the torch device set to mps, to reproduce delete that.
from haystack.nodes import EntityExtractor
from haystack.pipelines import Pipeline
from haystack.nodes import Crawler, PreProcessor, BM25Retriever, FARMReader

import torch

from haystack.document_stores import FAISSDocumentStore
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")
#crawling pipeline 
crawler = Crawler(
    urls=["https://www.lonelyplanet.com/norway"],   # Websites to crawl
    filter_urls = ["norway"],
    crawler_depth=1,    # How many links to follow
    output_dir=None  # The directory to store the crawled files, not very important, we don't use the files in this example
)


entity_extractor = EntityExtractor(model_name_or_path="dslim/bert-base-NER",devices=[torch.device("mps")])

processor = PreProcessor(
    clean_empty_lines=False,
    clean_whitespace=False,
    clean_header_footer=False,
    split_by="sentence",
    split_length=30,
    split_respect_sentence_boundary=False,
    split_overlap=0 #try changing this in the future :)
)

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=crawler, name="crawler", inputs=['File'])
indexing_pipeline.add_node(component=processor, name="processor", inputs=['crawler'])
indexing_pipeline.add_node(entity_extractor, "EntityExtractor", ["processor"])
indexing_pipeline.add_node(component=document_store, name="document_store", inputs=['EntityExtractor'])
indexing_pipeline.run()

FAQ Check

System:

  • OS: MacOs
  • GPU/CPU: M1
  • Haystack version (commit or version number): 1.17.1
  • DocumentStore: FAISSDocumentStore
  • Reader:
  • Retriever:

Haradai avatar Jun 09 '23 23:06 Haradai

Hey @Haradai thanks for looking into this! It also looks like if we were to switch the type to float64 then this error would also be avoided. Take a look here.

I think this would be the better solution for now since I don't think json serializability is a requirement for every document store. If you'd be willing to open a PR for this that would be greatly appreciated!

sjrl avatar Jun 12 '23 14:06 sjrl

@sjrl Allright! I might give it a try thanks!

Haradai avatar Jun 13 '23 18:06 Haradai

Created PR #5750 for this issue. Please let me know if anything additional needs to be added.

w1gs avatar Sep 12 '23 16:09 w1gs