haystack Add support for Object storage

Currently, Haystack supports storing data into ElasticSearch, InMemory, and RDBMS. It would be nice to add support of Object storage like S3, which is very cheap and have less hassle to maintain.

In the first step, AWS s3 can be supported as they recently added s3 select option which can help retrieve only a subset of data from an object (currently support CSV file object in compressed or uncompressed format).

Ideally, we can add a Metadata service as well which may help to use Haystack along with Data Lakes.

Jan 20 '22 12:01 lalitpagaria

Hello @lalitpagaria! Now I'm not entirely sure, but I think we already had this discussion previously (I'll link the issue if I find it back) and we came to the conclusion that Object stores like S3 simply did not have the features required to implement an Haystack document store. The addition of s3 select might change things, but I imagine that's not the only feature that was missing, unfortunately. However, if I find we had no issue open on this topic yet, let's keep this one to track of the status of these Object stores in the future :slightly_smiling_face:

Jan 20 '22 13:01 ZanSara

@ZanSara and @lalitpagaria It's been this discussion here I think. Seems like you could use SQL-syntax to query table-files like CSV. So this could be a feature of SQLDocumentStore. However I doubt that the performance of the underlying system can cope with relational databases like postgres. For retrieval you would need something like faiss or milvus on top, anyway. I admit, it would be interesting to see how fast s3 select is (I expect it to be fastest with parquet files underneath instead of CSVs due to parquet's columnar format) compared to RDBMSs.

Jan 20 '22 16:01 tstadel

@tstadel agree with you on the performance aspect. But there are few use cases where s3 can be a good alternative -

Very low infra cost
Very less operation overhead
Good candidate for searching in archives (cold documents which are rarely searched)
Now it is strongly consistent
Cross-region replication is very smooth in comparison to RDBMS, hence easily reducing network latency and availability
S3 data can be also kept in encrypted format hence less chance of data leaks
It can be a good candidate to store multi-mode data like images, audio, video etc.

There are few articles about using S3 as a database https://petewarden.com/2010/10/01/how-i-ended-up-using-s3-as-my-database/

https://dev.to/aws-builders/using-aws-s3-as-a-database-17l0

https://www.percona.com/blog/querying-archived-rds-data-directly-from-an-s3-bucket/

Jan 20 '22 16:01 lalitpagaria

haystack haystack copied to clipboard

Add support for Object storage

haystack
haystack copied to clipboard