vault icon indicating copy to clipboard operation
vault copied to clipboard

Speed up dynamodb List() by only getting keys

Open thomashargrove opened this issue 2 years ago • 1 comments

List() in physical.Backend only returns keys, so there is no need to retrieve values from DynamoDB. ProjectionExpression tells the query what fields we want, but since Key and Path are reserved words in DynamoDB we need to alias them using ExpressionAttributeNames. This change reduces the amount of traffic from DynamoDB to Vault.

Tested on my local instance, and I verified in a debugger that values are no longer on the wire. I don't see any unit tests covering dynamodb.List(), and making a full mock of dynamodb would be quite a bit of work.

thomashargrove avatar Jun 12 '23 21:06 thomashargrove

CLA assistant check
All committers have signed the CLA.

hashicorp-cla avatar Jun 12 '23 21:06 hashicorp-cla

Hi there @thomashargrove !

Sorry that this has lingered so long. I did some research and it looks like this approach makes sense. I'm not the biggest dynamodb expert but everything seems good and I'd love to get this in.

If you're still interested, we'd need a changelog associated with this PR. It would need to be named 21159.txt in the changelog directory, and something like storage/dynamodb: improved speed of list operation, or whatever you think is valuable to note (I'm not sure where hasChildren is used). You can look at other recent entries for inspiration.

It would also be great if you could show a before/after with the output of a command that uses this method, just for my own sanity. I'll also kick off CI just to make sure everything's green.

Thanks in advance!

VioletHynes avatar Jun 07 '24 14:06 VioletHynes

Thanks for taking a look at this. I made two build of vault, on with my changes and one without. I ran one at a time, both pointing to the same backend DynamoDB table. I pre-populated 500 secrets in one path with random data in them to give them some size.

On the build with my change I did a directory listing twice:

$ time vault kv list -tls-skip-verify kv/listtest/
...
real	0m2.255s
user	0m0.053s
sys	0m0.032s
$ time vault kv list -tls-skip-verify kv/listtest/
...
real	0m0.327s
user	0m0.047s
sys	0m0.023s

And using nettop I captured the bytes_in from the vault server before and after each list call:

$ nettop -P -L 1 -p 91974
time,,interface,state,bytes_in,bytes_out,rx_dupe,rx_ooo,re-tx,rtt_avg,rcvsize,tx_win,tc_class,tc_mgt,cc_algo,P,C,R,W,arch,
10:35:59.072582,vault.91974,,,503347,357841,1318,0,2136,,,,,,,,,,,,

$ nettop -P -L 1 -p 91974
time,,interface,state,bytes_in,bytes_out,rx_dupe,rx_ooo,re-tx,rtt_avg,rcvsize,tx_win,tc_class,tc_mgt,cc_algo,P,C,R,W,arch,
10:36:03.095878,vault.91974,,,643276,406383,1318,0,4772,,,,,,,,,,,,

$ nettop -P -L 1 -p 91974
time,,interface,state,bytes_in,bytes_out,rx_dupe,rx_ooo,re-tx,rtt_avg,rcvsize,tx_win,tc_class,tc_mgt,cc_algo,P,C,R,W,arch,
10:36:20.229671,vault.91974,,,645604,421238,1318,0,7408,,,,,,,,,,,,
 
$ nettop -P -L 1 -p 91974
time,,interface,state,bytes_in,bytes_out,rx_dupe,rx_ooo,re-tx,rtt_avg,rcvsize,tx_win,tc_class,tc_mgt,cc_algo,P,C,R,W,arch,
10:36:23.338528,vault.91974,,,773622,427347,1318,0,7408,,,,,,,,,,,,

The average bytes_in was 133,974.

The same test against Vault without my change:

$ time vault kv list -tls-skip-verify kv/listtest
real	0m2.329s
user	0m0.050s
sys	0m0.030s

$ time vault kv list -tls-skip-verify kv/listtest
real	0m0.384s
user	0m0.051s
sys	0m0.026s

and

$ nettop -P -L 1 -p 91525
time,,interface,state,bytes_in,bytes_out,rx_dupe,rx_ooo,re-tx,rtt_avg,rcvsize,tx_win,tc_class,tc_mgt,cc_algo,P,C,R,W,arch,
10:33:34.887110,vault.91525,,,828415,669808,1318,0,4835,,,,,,,,,,,,
$ nettop -P -L 1 -p 91525
time,,interface,state,bytes_in,bytes_out,rx_dupe,rx_ooo,re-tx,rtt_avg,rcvsize,tx_win,tc_class,tc_mgt,cc_algo,P,C,R,W,arch,
10:33:51.521955,vault.91525,,,1090191,810069,1318,0,6482,,,,,,,,,,,,

$ nettop -P -L 1 -p 91525
time,,interface,state,bytes_in,bytes_out,rx_dupe,rx_ooo,re-tx,rtt_avg,rcvsize,tx_win,tc_class,tc_mgt,cc_algo,P,C,R,W,arch,
10:34:32.048074,vault.91525,,,1067321,857685,1318,0,11580,,,,,,,,,,,,
 $ nettop -P -L 1 -p 91525
time,,interface,state,bytes_in,bytes_out,rx_dupe,rx_ooo,re-tx,rtt_avg,rcvsize,tx_win,tc_class,tc_mgt,cc_algo,P,C,R,W,arch,
10:34:35.509749,vault.91525,,,1288377,863724,1561,0,11580,,,,,,,,,,,,

Average bytes_in is 241,416. Overall almost a 50% reduction in bytes on the wire, but almost no change in latency. If I had a larger directory with bigger secrets I expect more speedup. (100KiB is pretty quick to move)

As for hasChildren() this is used for the recursive delete to clean up placeholder entries in DynamoDB. I verified it worked by creating a key with nested directories then removing it:

$ vault kv put -tls-skip-verify kv/deletetest/a/b/c/d k=v
======= Secret Path =======
kv/data/deletetest/a/b/c/d

======= Metadata =======
Key                Value
---                -----
created_time       2024-06-07T17:41:12.179886Z
custom_metadata    <nil>
deletion_time      n/a
destroyed          false
version            1

$ vault kv list -tls-skip-verify kv/deletetest/a/b/
Keys
----
c/

$ vault kv metadata delete -tls-skip-verify kv/deletetest/a/b/c/d
Success! Data deleted (if it existed) at: kv/metadata/deletetest/a/b/c/d

$ vault kv list -tls-skip-verify kv/deletetest/a/b/
No value found at kv/metadata/deletetest/a/b

thomashargrove avatar Jun 07 '24 17:06 thomashargrove

Thanks again, and thanks for your patience in awaiting a review for this :)

VioletHynes avatar Jun 07 '24 19:06 VioletHynes