dynamodb-replicator
dynamodb-replicator copied to clipboard
Plain Text Keys as S3 Filename and S3 files can be grouped on Multitenancy columns
Some features I needed for my project - Would love to hear your thoughts and if you are interested in these features?
Also updated to be compatible with Node 4 runtime in Lambda
Hi! Could you explain a little more what the use-case is for what you're calling a MultiTenancy column? I'm seeing that you're sending data to slightly different S3 locations?
As for the clear-text s3 keys, I would advise against this -- we implemented the hashed filenames as a way to add randomness to the S3 keys. Without this randomness, S3 can run into some very hard throughput limitations that can cripple the incremental backup if write loads on your dynamo table are above ~400 per second. http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html for more information.
Sorry, I should have provided some context. So an explanation:
- The first feature was MultiTenancy column. This allows us to "group" incremental backups in dynamic prefixes within S3. In the scenario of multitenancy, this allows us to separate client data, and enables us to easily migrate data between environments for one client. Ie - Move all client data from UAT to Production or vice versa.
- The second feature - Clear text S3 key was to enable us to easily identify the row we are looking for, opposed to generating an MD5 of the Key to correlating to an S3 key. The S3 throughput limitation wasn't something I was aware of, but we use GUIDs (v4 schema in most cases), so they are fairly randomised, which I'm expecting to have a similiar impact to MD5. In cases where we don't have GUIDs, I'm assuming we can fall back on MD5.
Both these features could be worked around, but I this just makes life a little easier. I was interested to find out if you are keen to have these merged in (this PR hasn't been reviewed, just worth starting the conversation)
The last update I made was update to leverage Node v4 runtime on Lambda.
I'm hesitant about both of these scenarios because of the potential to cause S3 throttling.
- S3 throughput is controlled by partitioning -- each partition can support only so much throughput. Mapping from keys --> partitions is entirely dependent on the characters in the object keys. By grouping database records under particular prefixes (the MultiTenancy column here), you're opening the door for a particularly "hot" client leading to a particularly hot S3 partition. This can lead to S3 throttling writes across your bucket while the hot partition is split into more in order to handle the load
- Clear-text keys are nice, but if you don't use GUIDs or randomly distributed IDs in dynamo, then you again run the same risk of hot S3 partitions. Have you looked at the CLI tools to check an incremental record on S3 vs. its state in DynamoDB or lookup the history of S3 incremental record versions for a particular DynamoDB key? I wonder if either of these tools can help your use-case?
Hey Ryan,
Absolutely understand your concerns regarding the throttling, and that is the reason why I've left these as options that can be opted into, opposed to default on.
The idea behind the prefix option is similar - it can also cause throttling issues. MultiTenancyColumn (bad name), can be viewed as a dynamic version of the prefix, based on the data.
MD5 is a great solution for the throttling problem, but only if you don't use the prefix option. But the problem it creates is correlating dynamo keys to their S3 Key if they have been deleted - could be impossible if you dont know the entire key itself.
I have had a look at the tool you linked, and they can be useful in creating an MD5 of the key specified (I'm not sure if it would work for deleted records?)
We are aiming to use this as a DR solution. Which enables us to solve problems where a developer (or security breach) accidently deletes/updates records (or tables). We would need to roll back to a point in time, opposed to knowing the specific key(s) we need to restore.
Out of interest, how reliable has the replicator tool been for you in terms of incremental backups to S3?
Abhaya
We implement versioning on the S3 bucket where incremental backups land. With this in hand, the CLI tool is capable of finding the complete history of any dynamodb record, including deleted ones.
Further, we run a separate process that routinely scans the S3 incremental backup and rolls results into a single file. We call it a "snapshot" because it roughly represents the state of the entire table at some point in time. See https://github.com/mapbox/dynamodb-replicator/blob/master/s3-snapshot.js.
These files give us the ability to roll back the entire table to a previous state, though we are more inclined to roll back individual records if needs be, using S3 versioning and history.
Out of interest, how reliable has the replicator tool been for you in terms of incremental backups to S3?
👌 we love it. We've yet to encounter any evidence of data that was dropped from the dynamodb stream --> lambda --> s3 pipeline.