float32 json serialization error in FAISSDocumentStore with EntityExtractor in indexing pipeline
Describe the bug
Using FAISSDocumentStore(faiss_index_factory_str="Flat") document store I had a simple pipeline crawler -> preprocessor -> store, that was working well. When I introduced an EntityExtractor just after the preprocessor I get an error when saving the documents. I think it could be due to that the NER score in the document metadata is a float.
Error message
Exception: Exception while running node 'document_store': (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(builtins.TypeError) Object of type float32 is not JSON serializable
[SQL: INSERT INTO meta_document (id, name, value, document_id, document_index) VALUES (?, ?, ?, ?, ?)]
[parameters: [{'document_id': '661e6ab86b528239f63dd4b4a28fe9b', 'document_index': 'document', 'name': 'entities', 'value': [{'entity_group': 'LOC', 'score': 0.99963534, 'word': 'Norway', 'start': 7, 'end': 13}, {'entity_group': 'LOC', 'score': 0.99975854, 'word': 'Norway', 'start': 136, 'end': 142}]}]]
Enable debug logging to see the data that was passed when the pipeline failed.
Expected behavior A clear and concise description of what you expected to happen.
Additional context Maybe this could be solved by having the output metadata of the NER extraction all be saved as a string?
To Reproduce
# Note: I am using M1 gpu so I have the torch device set to mps, to reproduce delete that.
from haystack.nodes import EntityExtractor
from haystack.pipelines import Pipeline
from haystack.nodes import Crawler, PreProcessor, BM25Retriever, FARMReader
import torch
from haystack.document_stores import FAISSDocumentStore
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")
#crawling pipeline
crawler = Crawler(
urls=["https://www.lonelyplanet.com/norway"], # Websites to crawl
filter_urls = ["norway"],
crawler_depth=1, # How many links to follow
output_dir=None # The directory to store the crawled files, not very important, we don't use the files in this example
)
entity_extractor = EntityExtractor(model_name_or_path="dslim/bert-base-NER",devices=[torch.device("mps")])
processor = PreProcessor(
clean_empty_lines=False,
clean_whitespace=False,
clean_header_footer=False,
split_by="sentence",
split_length=30,
split_respect_sentence_boundary=False,
split_overlap=0 #try changing this in the future :)
)
indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=crawler, name="crawler", inputs=['File'])
indexing_pipeline.add_node(component=processor, name="processor", inputs=['crawler'])
indexing_pipeline.add_node(entity_extractor, "EntityExtractor", ["processor"])
indexing_pipeline.add_node(component=document_store, name="document_store", inputs=['EntityExtractor'])
indexing_pipeline.run()
FAQ Check
- [Yes ] Have you had a look at our new FAQ page?
System:
- OS: MacOs
- GPU/CPU: M1
- Haystack version (commit or version number): 1.17.1
- DocumentStore: FAISSDocumentStore
- Reader:
- Retriever:
Hey @Haradai thanks for looking into this! It also looks like if we were to switch the type to float64 then this error would also be avoided. Take a look here.
I think this would be the better solution for now since I don't think json serializability is a requirement for every document store. If you'd be willing to open a PR for this that would be greatly appreciated!
@sjrl Allright! I might give it a try thanks!
Created PR #5750 for this issue. Please let me know if anything additional needs to be added.