haystack icon indicating copy to clipboard operation
haystack copied to clipboard

SQL based Datastores fail when document metadata has a list

Open sridhar opened this issue 2 years ago • 10 comments

This issue is easily reproducible in FAISSDataStore and SQLDataStore. When the value of the key in doc.meta is set to a list, then document_store.write fails. This happens when I'm using TikaConverter to convert a directory of files, and some of them have lists in metadata.

Error message File "/home/sridhar/devel/doc_intelligence/haystack/haystack/document_stores/faiss.py", line 295, in write_documents super(FAISSDocumentStore, self).write_documents( File "/home/sridhar/devel/doc_intelligence/haystack/haystack/document_stores/sql.py", line 401, in write_documents self.session.query(MetaDocumentORM).filter_by(document_id=doc.id).delete() File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3209, in delete result = self.session.execute( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1660, in execute ) = compile_state_cls.orm_pre_session_exec( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1829, in orm_pre_session_exec session.autoflush() File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2257, in autoflush util.raise(e, with_traceback=sys.exc_info()[2]) File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise raise exception File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2246, in _autoflush self.flush() File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3383, in flush self._flush(objects) File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3523, in flush transaction.rollback(capture_exception=True) File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit compat.raise( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise raise exception File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3483, in _flush flush_context.execute() File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute rec.execute(self) File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute util.preloaded.orm_persistence.save_obj( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj _emit_insert_statements( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1238, in _emit_insert_statements result = connection._execute_20( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20 return meth(self, args_10style, kwargs_10style, execution_options) File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 332, in _execute_on_connection return connection._execute_clauseelement( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement ret = self._execute_context( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context self.handle_dbapi_exception( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2043, in handle_dbapi_exception util.raise( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise raise exception File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context self.dialect.do_execute( File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute cursor.execute(statement, parameters) sqlalchemy.exc.InterfaceError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (sqlite3.InterfaceError) Error binding parameter 2 - probably unsupported type. [SQL: INSERT INTO meta_document (id, name, value, document_id, document_index) VALUES (?, ?, ?, ?, ?)] [parameters: ('60afe526-9500-420d-8170-0aeff009e205', 'xmpMM:History:InstanceID', ['xmp.iid:f12cf1f4-997f-474e-b7e1-4eb15ac63f4f', 'xmp.iid:d8516db1-2277-c840-b5d8-12ba4225175d'], 'feb937070fb26bb3f3cf9f397a3 531a5', 'document')] (Background on this error at: https://sqlalche.me/e/14/rvf5)

Expected behavior The workaround is to simply set the particular meta variable to "ignore" in the TikaConverter.convert function. There should be a knob to ignore these metadatas automatically.

FAQ Check Yes

System:

  • OS: Ubuntu 18.04
  • GPU/CPU: Nvidia Titan XP/Intel(R) Xeon(R) CPU E5-2697
  • Haystack version (commit or version number): ba08fc86f5871555537cc3766889e9cae8c5ad03
  • DocumentStore: FAISS
  • Reader: TransformersReader
  • Retriever: EmbeddingRetriever

sridhar avatar Jul 11 '22 23:07 sridhar

@tstadel @ZanSara I too have encountered this error in the past.

I see two main alternative ways to address this issue (in the document store implementation):

  • Skip the metadata variables which are lists and alert the user with a warning
  • Translate these specific list variables into some string-based representation and alert the user with a warning

What do you think? Any other ideas or advice?

anakin87 avatar Jul 17 '22 11:07 anakin87

After reading the SQL document store implementation (after @anakin87 post), I think it's possible to address this by these alterations on the way document meta is stored:

  1. Store meta as JSON data type (SQL Alchemy already supports for Postgres and MySQL as example)
  2. If the underlying database doesn't support native JSON, there would be two options: 2.1. Throw a warning (fast implementation, but not the best UX, nor a well properly designed) 2.2. Store as a string (JSON dump) and then, after SQL Alchemy reads it, convert to a list again for further manipulation.

It could be handled inside the SQL document store implementation without any impact (code and behavior changes) outside of it.

danielbichuetti avatar Jul 17 '22 12:07 danielbichuetti

As we can see here, https://github.com/deepset-ai/haystack/blob/baf5ef81f75518272e325c18964f21f4c08a082d/haystack/document_stores/sql.py#L93-L97

currently meta is stored in a table, where both name and value are strings.

As @danielbichuetti suggests, storing the whole meta as a JSON would provide more flexibility. However, on the other hand, this would radically change the current DB structure, probably also impacting the various filtering mechanisms.

After these reflections, my previous alternative proposals could be a good compromise.

  • Skip the metadata variables which are not strings and alert the user with a warning
  • Translate these specific variables into some string-based representation and alert the user with a warning

@tstadel @ZanSara what do you think? How is it best to deal with this issue?

anakin87 avatar Jul 19 '22 12:07 anakin87

@anakin87 Indeed there will need to be some change on the DB structure, but the filtering mechanisms will keep the correct work (after some changes). Postgres, MySQL, Maria and SQL Server allow “filtering” (complex queries) inside the JSON data type. Main issue is the compatibility with already existent databases, so it would fall under the breaking change. Usually, when db structure changes, a python migration script could be provided. So new users would just allow haystack to fix the DB for then. Indeed, at some point such a script will need to be made, it's almost impossible to eternally avoid improvements based on the “don't touch DB structure culture”.

So, haystack won't have to discard the metadata or use a simple string data type where filtering would be broken without harder custom code then the JSON change. Despise being less time-consuming to implement, I think this kind of behavior will reduce user expectations on the framework. Well, of course, it would be some improvement over the actual implementation.

Furthermore, SQL Lite demands for JSON filtering some custom code for string. Since the major SQL databases can have its implementation improved and SQL Lite won't fit, how about in future a split on this Datastore?

  1. SQLLiteDocumentStore - limited in functionality, like the list issues.
  2. SQLDocumentStore - improved current class.

danielbichuetti avatar Jul 19 '22 12:07 danielbichuetti

Just my 2c here: Some of the metadata generated by these parsers are quite expansive and people will not be interested in using sqlite for this. In a separate project, I use the likes of BigQuery and Spark for this kind of query For now I think, just a simple isinstance() check is good enough and the metadata with a lot of these list elements can simply be ignored for now.

sridhar avatar Jul 19 '22 19:07 sridhar

@sridhar Indeed, my thoughts were that people who use SQL Lite commonly are doing small tests, and a more complex metadata processing wouldn't be needed. Considering performance problems, SQL Lite is highly not recommended for a mainstream, not tests or small scale, NLP scenario where workload exceed even some low profiles.

Some of the metadata generated by these parsers are quite expansive

Yes. This is one point that the main idea is already at code (filtering metadata). With JSON, it would still be easy to exclude fields inside the hierarchy that aren't need and keep the ones of interest in NLP.

#2809 is somehow related to this subject. Implementing a JSON data type support into the data store would allow lots of other information which could be of interest. The user would decide when building his pipeline.

I think that for now, doing a type check as a safe measure would be just a starting point. Error is just bypassed, losing some data.

Maybe a progressive change so current users could still use this data store in such scenarios, but knowing they will lose the metadata. Then, the split of the document stores (SQL Lite capabilities are much limited compared to current mainstream ones), which would technically be a breaking change, as the document store name would change and users should update their codes. And after, the improvement of the class to handle any kind of objects in metadata, keeping the current features and improving to support complex objects into it.

danielbichuetti avatar Jul 19 '22 20:07 danielbichuetti

Hi @anakin87, thanks again for opening a PR for this issue! Is it safe to close this issue now?

sjrl avatar Jul 26 '22 08:07 sjrl

Hey @sjrl, thanks! For the moment, this error is resolved: if the metadata contains types that are not acceptable by the SQLDocumentStore, the specific values are skipped and we raise a non-blocking error message. In any case, this issue contains interesting considerations about metadata handling, so it's up to you to decide whether to close it or keep it open.

anakin87 avatar Jul 26 '22 09:07 anakin87

Hey, @anakin87 I see what you are saying. It sounds like there is still more to be done on this topic in the future. I think the best course of action here would be to open a new issue with the details of what should be done next and then close this issue. Otherwise this issue could get stale or confusing to new people reading it. What do you think?

sjrl avatar Jul 26 '22 10:07 sjrl

I want to add to this issue to keep track. A community member ran into this problem while using FAISSDocumentStore with TransformersDocumentClassifier. The problem is the classification results are added as dictionary within dictionary as results in metadata but these sql based document stores don't accept them, which means you can't re-write the results from the classifier back to your document store. @ZanSara @tstadel wdyt, is it best to add support for this in TransformersDocumentClassifier or fix the issue in the DocumentStores?

TuanaCelik avatar Aug 08 '22 13:08 TuanaCelik

@sjrl I think that this issue can be closed as the problem has actually been solved. In any case, the interesting reflections we have made will remain here. What @TuanaCelik reported is addressed in another issue.

anakin87 avatar Sep 07 '22 21:09 anakin87