datasketch icon indicating copy to clipboard operation
datasketch copied to clipboard

Proposal: DDB storage layer

Open edholland opened this issue 5 years ago • 6 comments

Hi,

It would be nice to have dynamoDB as a storage layer. I have implemented a basic version of this within storage.py. Please could someone take a look to see if this is something the project would be willing to incorporate - and if i have a sensible approach in implementation. I dont believe this is "production ready" quality code yet, but i think . early feedback is useful!

edholland avatar Jun 06 '19 16:06 edholland

This is interesting. Why Dynamo DB?

ekzhu avatar Jun 06 '19 22:06 ekzhu

Its an awful lot cheaper than the redis instances we're using at the moment - and there's practically zero management overhead to boot!

edholland avatar Jun 07 '19 10:06 edholland

A few things to consider:

  1. LSH's query method probes multiple hash tables, is this going to be efficient for DynamoDB (think network latency) comparing to an in-memory Redis instance on the same node as the LSH?
  2. What is the memory footprint of your index and why not usepickle package to save the index to the disk periodically?

ekzhu avatar Jun 07 '19 12:06 ekzhu

  1. Yeah its not overly performant - overall requests take about 500 millisec. However this is "good enough" for our use-case
  2. Concurrency mostly - We're going to be running this within AWS Lambda functions so there is no shared filesystem. Seems easiest to move the state out of application layer into the database layer

edholland avatar Jun 07 '19 12:06 edholland

I see. This is quite interesting. Please give us some time.

ekzhu avatar Jun 13 '19 01:06 ekzhu

Very interesting. @edholland can you recommend a ddb emulator to help with the dev setup, and CI?

amirouche avatar Sep 29 '23 16:09 amirouche