dynamodump Implement read rate limiting

I implemented this feature since we needed to limit the read capacity when performing the backup to avoid throttling, errors, and degrading running application performance. I am submitting this pull request because I think other people might be interested in it.

I added a new option --readCapacityRatio which specifies the part of the total table read capacity that will be used for backup. Generally the value (0.0, 1.0> should be used for it.

This feature is implemented using two independent mechanisms:

Calculate optimal limit based on table size. Using this I could generally sleep for around a second between each request and be relatively satisfied, except there will be a unnecessary wait if capacity is very high. There could also be problems when table items greatly vary in size. This change will result in smaller .json file chunks instead of standard 1M and can lead to more requests and slightly higher read capacity consumption than absolute minimum.
Limit the request rate by sleeping for calculated time based on actual consumed capacity. Implementation is based on Guava's RateLimiter class.

Implementation is inspired by / loosely based on this article: https://aws.amazon.com/fr/blogs/developer/rate-limited-scans-in-amazon-dynamodb/. I tried to logically structure the code in separate commits and I also added unit tests.

Please let me know what do you think.

May 19 '17 19:05 loomchild

Sorry, I didn't notice there's a .travis.yml. I am working on test fix.

May 21 '17 17:05 loomchild

Done.

May 21 '17 18:05 loomchild

@loomchild thanks for the PR, could you resolve the conflicts and I would look at merging it, cheers!

May 28 '17 07:05 bchew

Hi, sorry for late reply, I will do it this week. I noticed that main method was added to dynamodump.py - great, I would be able to drop my first commit.

May 31 '17 08:05 loomchild

Done.

I think there's a regression introduced in some of recently merged commits regarding wildcards. When I specify source table as Dev*:

python dynamodump.py -m backup -r eu-central-1 --prefixSeparator . -s Dev* --dumpPath dump

then I get the following reply, but nothing happens:

INFO:root:Found 6 table(s) in DynamoDB host to backup: Dev.Company, Dev.MemberOrganization, Dev.Prefix, Dev.Product, Dev.RequestLog, Dev.User

It seems like the AttributeError is never thrown, since srcTable is equal to Dev*. Do you want me to raise a separate issue for it?

Jun 04 '17 14:06 loomchild

@loomchild the regression you've mentioned has now been fixed after the merge of this PR: https://github.com/bchew/dynamodump/pull/36, tests has also been added as of https://github.com/bchew/dynamodump/commit/5e2de43d805af784edc938ffca4ef4ccbabf0adf. Unfortunately, your PR has conflicts again, would you like to resolve it?

Jul 16 '17 11:07 bchew

I've been testing this out, currently running with my own version that fixes the above conflict with the latest commits here.

It seems to work well, without providing any read capacity it does automatically set it to the maximum of the table, while the default before would go over the read capacity of my tables.

Setting a read capacity mostly works, although I noticed one issue. If you provide a read capacity ratio that takes the capacity below 1, it errors. For example, I have a table with 3 read capacity and I set the ratio to 0.25, so it tried to use a read capacity of 0.8, which caused an error.

INFO:root:Dumping table items for ENCODE_TEST, read capacity: 3.0, will use 0.8 for backup Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "dynamodump.py", line 630, in do_backup return_consumed_capacity="TOTAL") File "/home/ubuntu/.local/lib/python2.7/site-packages/boto/dynamodb2/layer1.py", line 2246, in scan body=json.dumps(params)) File "/home/ubuntu/.local/lib/python2.7/site-packages/boto/dynamodb2/layer1.py", line 2842, in make_request retry_handler=self._retry_handler) File "/home/ubuntu/.local/lib/python2.7/site-packages/boto/connection.py", line 954, in _mexe status = retry_handler(response, i, next_sleep) File "/home/ubuntu/.local/lib/python2.7/site-packages/boto/dynamodb2/layer1.py", line 2882, in _retry_handler response.status, response.reason, data) ValidationException: ValidationException: 400 Bad Request {u'message': u"1 validation error detected: Value '0' at 'limit' failed to satisfy constraint: Member must have value greater than or equal to 1", u'__type': u'com.amazon.coral.validate#ValidationException'}

Not sure what the best solution is to this. In my case I'd want to set it to 1 rather than fail, but perhaps that would be a bad solution in other cases.

Jul 20 '17 19:07 wjoe

Hi guys, thanks for info and testing, and sorry for delay. I will look into this / work on it on the weekend.

Jul 21 '17 08:07 loomchild

Initially I wasn't able to reproduce the issue you described @wjoe, because calculating limit is a bit more complicated since it also depends on average item size and not just read capacity. In the end I was able to reproduce it and I fixed it by setting minimum scan limit to 1 element.

Jul 23 '17 19:07 loomchild

@bchew I resolved the conflicts and retested the code.

Jul 23 '17 19:07 loomchild

@loomchild thanks for that. I've just gone through it again and I don't think the new tests you've added is hooked up to the travis CI build (or am I missing something?). Would you be able to add it as well, or let me know how you're running it so I can add it in, thanks!

Aug 07 '17 11:08 bchew

I finally managed to do it. In fact the change was far from trivial, because:

Travis python path is different than local, the only way I found to run the unit tests and be able to import dynamodump was to copy the tests to root directory during Travis build.
There is an issue with boto - see my 7th commit and https://github.com/travis-ci/travis-ci/issues/7940 for more details.
There was an issue with dynalite not being able to run from node_modules. I am not sure why it was possible before this change. See my 8th commit for details.

In fact the last 2 commits might be necessary independent of my PR since I believe the Travis build will fail on master if you try to run it now. You might want to apply these commits on master independently of my PR.

Please let me know what do you think.

Sep 23 '17 19:09 loomchild

@loomchild sincere apologies that I've taken this long to respond 😞 - would you still be keen in getting your changes merged? I fully understand if you're not and would not like to proceed

Aug 27 '21 12:08 bchew

Hi @bchew,

OK, I will give it a try, although I don't work with DynamoDB anymore.

Sep 03 '21 15:09 loomchild

Hi @bchew. I am afraid I won't be able to work on this PR. A bit too much time has passed, I don't work with DynamoDB anymore (don't even have a working AWS account), I haven't developed in Python over the last few years, etc. It will be very difficult for me to re-start working on this PR at this point. Please don't hesitate to close it, or perhaps someone else from the community can help to adapt it to the current code version. I will try to answer questions about the code if needed.

Dec 04 '21 19:12 loomchild

dynamodump dynamodump copied to clipboard

Implement read rate limiting

dynamodump
dynamodump copied to clipboard