haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Add support for Object storage

Open lalitpagaria opened this issue 3 years ago • 3 comments

Currently, Haystack supports storing data into ElasticSearch, InMemory, and RDBMS. It would be nice to add support of Object storage like S3, which is very cheap and have less hassle to maintain.

In the first step, AWS s3 can be supported as they recently added s3 select option which can help retrieve only a subset of data from an object (currently support CSV file object in compressed or uncompressed format).

Ideally, we can add a Metadata service as well which may help to use Haystack along with Data Lakes.

lalitpagaria avatar Jan 20 '22 12:01 lalitpagaria

Hello @lalitpagaria! Now I'm not entirely sure, but I think we already had this discussion previously (I'll link the issue if I find it back) and we came to the conclusion that Object stores like S3 simply did not have the features required to implement an Haystack document store. The addition of s3 select might change things, but I imagine that's not the only feature that was missing, unfortunately. However, if I find we had no issue open on this topic yet, let's keep this one to track of the status of these Object stores in the future :slightly_smiling_face:

ZanSara avatar Jan 20 '22 13:01 ZanSara

@ZanSara and @lalitpagaria It's been this discussion here I think. Seems like you could use SQL-syntax to query table-files like CSV. So this could be a feature of SQLDocumentStore. However I doubt that the performance of the underlying system can cope with relational databases like postgres. For retrieval you would need something like faiss or milvus on top, anyway. I admit, it would be interesting to see how fast s3 select is (I expect it to be fastest with parquet files underneath instead of CSVs due to parquet's columnar format) compared to RDBMSs.

tstadel avatar Jan 20 '22 16:01 tstadel

@tstadel agree with you on the performance aspect. But there are few use cases where s3 can be a good alternative -

  • Very low infra cost
  • Very less operation overhead
  • Good candidate for searching in archives (cold documents which are rarely searched)
  • Now it is strongly consistent
  • Cross-region replication is very smooth in comparison to RDBMS, hence easily reducing network latency and availability
  • S3 data can be also kept in encrypted format hence less chance of data leaks
  • It can be a good candidate to store multi-mode data like images, audio, video etc.

There are few articles about using S3 as a database https://petewarden.com/2010/10/01/how-i-ended-up-using-s3-as-my-database/

https://dev.to/aws-builders/using-aws-s3-as-a-database-17l0

https://www.percona.com/blog/querying-archived-rds-data-directly-from-an-s3-bucket/

lalitpagaria avatar Jan 20 '22 16:01 lalitpagaria