Concurrent DynamoDB Client Creating Excessive KMS API Calls
Describe the bug
When the DynamoDB client is used concurrently with a customer-managed key it creates a Decrypt API call per goroutine / thread instead of creating only one API call for the client. More detail is described here: https://github.com/aws/aws-sdk-go-v2/discussions/2128.
Expected Behavior
The DynamoDB client makes 1 Decrypt API call up to every 5 minutes (based on the documentation here) for a single client used in any number of goroutines.
Current Behavior
The DynamoDB client makes N Decrypt API call up to every 5 minutes for a single client based on the number of goroutines. For example, if the client is used in 25 goroutines, then close to 25 Decrypt calls are made.
Reproduction Steps
https://github.com/aws/aws-sdk-go-v2/discussions/2128#discussioncomment-5899762
Possible Solution
No response
Additional Information/Context
No response
AWS Go SDK V2 Module Versions Used
github.com/aws/aws-sdk-go-v2 v1.18.0 github.com/aws/aws-sdk-go-v2/config v1.18.25 github.com/aws/aws-sdk-go-v2/credentials v1.13.24 github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.10.25 github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.13.3 github.com/aws/aws-sdk-go-v2/internal/configsources v1.1.33 github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.4.27 github.com/aws/aws-sdk-go-v2/internal/ini v1.3.34 github.com/aws/aws-sdk-go-v2/service/dynamodb v1.19.7 github.com/aws/aws-sdk-go-v2/service/dynamodbstreams v1.14.11 github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.9.11 github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.7.27 github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.9.27 github.com/aws/aws-sdk-go-v2/service/sso v1.12.10 github.com/aws/aws-sdk-go-v2/service/ssooidc v1.14.10 github.com/aws/aws-sdk-go-v2/service/sts v1.19.0 github.com/aws/aws-sdk-go-v2/service/xray v1.16.11
Compiler and Version used
go1.19.6 linux/arm64
Operating System and version
Debian 11
Hi @jshlbrd ,
I see you opened a discussion and also an issue. I will close the discussion in favor of tracking it here.
I'm a bit confused by the problem you are describing. When creating a table on Dynamo, you can specify which key you want to use for encryption at rest (whether its AWS managed or a CMK), the encryption is happening on the service side as described in this doc:
When you access an encrypted table, DynamoDB decrypts the table data transparently. You don't have to change any code or applications to use or manage encrypted tables. DynamoDB continues to deliver the same single-digit millisecond latency that you have come to expect, and all DynamoDB queries work seamlessly on your encrypted data.
In other words, the SDK does not know / care about encryption at rest for dynamo.
To make sure I'm not wrong, I provisioned Dynamodb tables with CMK per your description:
$ aws dynamodb describe-table --table-name aws_sdk_go_v2_2128
{
"Table": {
"AttributeDefinitions": [
{
"AttributeName": "pk",
"AttributeType": "S"
}
],
"TableName": "aws_sdk_go_v2_2128",
"KeySchema": [
{
"AttributeName": "pk",
"KeyType": "HASH"
}
],
"TableStatus": "ACTIVE",
"CreationDateTime": "2023-05-15T15:46:02.519000-07:00",
"ProvisionedThroughput": {
"NumberOfDecreasesToday": 0,
"ReadCapacityUnits": 1,
"WriteCapacityUnits": 1
},
"TableSizeBytes": 0,
"ItemCount": 0,
"TableArn": "arn:aws:dynamodb:us-east-1:REDACTED:table/aws_sdk_go_v2_2128",
"TableId": "REDACTED",
"SSEDescription": {
"Status": "ENABLED",
"SSEType": "KMS",
"KMSMasterKeyArn": "arn:aws:kms:us-east-1:REDACTED"
},
"TableClassSummary": {
"TableClass": "STANDARD"
},
"DeletionProtectionEnabled": false
}
}
Using the code you have provided I'm not seeing any network calls to KMS. You can enable client logs on your config like so:
ctx := context.TODO()
cfg, err := config.LoadDefaultConfig(ctx, config.WithClientLogMode(aws.LogResponseWithBody|aws.LogRequestWithBody))
if err != nil {
panic(err)
}
You will see 25 network calls to the Dynamodb service, but no calls are made to KMS.
Can you provide some more information about how you noticed the difference? From playing around with Cloudtrail I wasn't able to identify any Decrypt calls made to KMS. It would be helpful if you can point me to where you are seeing the Decrypt call being made so I can have a better understanding about the issue at hand.
Thanks, Ran~
Hey @RanVaknin, thanks for taking a look at this! The API calls are not made by the client, they're made by the DynamoDB service -- but the number of KMS calls made by DynamoDB appear to be the direct result of using concurrency with the Go client.
Referencing this documentation may help clarify (this is from the section about the GetItem API call):
DynamoDB caches the plaintext table keys for each principal in memory. If DynamoDB gets a request for the cached table key after five minutes of inactivity, it sends a new request to AWS KMS to decrypt the table key.
Here's an example CloudTrail event that shows the behavior I'm describing (removed and redacted private information):
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AssumedRole",
},
"invokedBy": "dynamodb.amazonaws.com"
},
"eventTime": "2023-05-16T00:19:11Z",
"eventSource": "kms.amazonaws.com",
"eventName": "Decrypt",
"awsRegion": "us-east-1",
"sourceIPAddress": "dynamodb.amazonaws.com",
"userAgent": "dynamodb.amazonaws.com",
"requestParameters": {
"encryptionContext": {
"aws:dynamodb:tableName": "aws_sdk_go_v2_2128_concurrency",
"aws:dynamodb:subscriberId": `AWS ACCOUNT ID`
},
"encryptionAlgorithm": "SYMMETRIC_DEFAULT"
},
"responseElements": null,
"requestID": `REQUEST ID`
"eventID": `EVENT ID`
"readOnly": true,
"resources": [
{
"accountId": `AWS ACCOUNT ID`
"type": "AWS::KMS::Key",
"ARN": `KMS Key ARN`
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": `AWS ACCOUNT ID`
"eventCategory": "Management"
}
I see approximately 25 events like the above for the concurrent client. One thing worth noting, I'm using role assumption and each Decrypt call has a unique access key ID. This is a guess, but is it possible that DynamoDB is treating each goroutine as its own client due to how authentication is handled when using a concurrent client?
Hi @jshlbrd ,
Thanks for additional info.
One thing worth noting, I'm using role assumption and each Decrypt call has a unique access key ID
When you say "unique access key ID" are you referring to accountId in the cloud trail event?
How are you assuming the role? Are you reading it from env variables? I don't have a lot of confidence in this theory, but maybe using an explicit assume role call would solve this issue:
ctx := context.TODO()
cfg, err := config.LoadDefaultConfig(ctx, config.WithClientLogMode(aws.LogResponseWithBody|aws.LogRequestWithBody))
if err != nil {
panic(err)
}
client := sts.NewFromConfig(cfg)
appCreds := stscreds.NewAssumeRoleProvider(client, "arn:aws:iam::REDACTED:role/role_name")
cfgCopy := cfg.Copy()
cfgCopy.Credentials = aws.NewCredentialsCache(appCreds)
svc := dynamodb.NewFromConfig(cfgCopy)
// ... more code here
When testing this code by enabling the request logs I can examine the request sent by each Goroutine and I notice in the Authorization header, that all access keys are the same.
Should look something like this:
SDK 2023/05/16 11:40:33 DEBUG Request
POST / HTTP/1.1
Host: dynamodb.us-east-1.amazonaws.com
User-Agent: aws-sdk-go-v2/1.18.0 os/macos lang/go/1.19.1 md/GOOS/darwin md/GOARCH/arm64 api/dynamodb/1.19.7
Content-Length: 69
Accept-Encoding: identity
Amz-Sdk-Invocation-Id: REDACTED
Amz-Sdk-Request: attempt=1; max=3
Authorization: AWS4-HMAC-SHA256 Credential={{Secret Access ID}}/20230516/us-east-1/dynamodb/aws4_request, SignedHeaders=accept-encoding;amz-sdk-invocation-id;amz-sdk-request;content-length;content-type;host;x-amz-date;x-amz-security-token;x-amz-target, Signature=REDACTED
Content-Type: application/x-amz-json-1.0
X-Amz-Date: 20230516T184033Z
X-Amz-Security-Token: REDACTED
X-Amz-Target: DynamoDB_20120810.GetItem
{"Key":{"pk":{"S":"b"}},"TableName":"aws_sdk_go_v2_2128_concurrency"}
This should be the same as the implicit assume role you are probably doing through injecting env variable but I thought we can try anyway. You should be able to see that all 25 requests contain the same "Credential=" pointing to the secret access ID produced by assume role.
Another theory I can think of, is because you are sending all of these requests at once, the Dynamo server is not able to effectively cache the requests - resulting in multiple Decrypt calls. A way I would try to work around it, is by sending a single getItem request synchronously, and then sending the remaining calls via Goroutines.
At any rate, this seems like a service side limitation. If none of these methods work for you, we will have to upstream this to the Dynamodb team internally. I can do this on your behalf, or if you file a support ticket, you'll get assigned a dedicated person that will help navigate this request to the service team - they usually have a little more tools than us to help expedite things. If you choose the latter, just make sure to have them CC me on the ticket using my alias rvaknin
Thanks again, Ran~
Thanks @RanVaknin!
When you say "unique access key ID" are you referring to accountId in the cloud trail event?
Not quite, it's the accessKeyId seen in the CloudTrail events which is unique for each call that originates from the Go client. For example:
{
"userIdentity": {
"type": "AssumedRole",
"principalId": "REDACTED",
"arn": "REDACTED",
"accountId": "REDACTED",
"accessKeyId": "REDACTED",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "REDACTED",
"arn": "REDACTED",
"accountId": "REDACTED",
"userName": "REDACTED"
}
},
"invokedBy": "dynamodb.amazonaws.com"
}
}
Based on your suggestions I confirmed the following:
- Using environment variables for role assumption vs explicitly assuming role doesn't change the behavior
- Executing a synchronous GetItem call before goroutines doesn't change the behavior
- I added a 10 second sleep after the synchronous call to let the DynamoDB service cache "catch up" and that made no difference
If you can file a request with the DynamoDB team then that would help. My feeling is that, for some reason, the DynamoDB service does not recognize that the Go client is a single requester when it is used concurrently; it may be treating each goroutine as a unique client, which causes the key caching feature to fail.
Hi @jshlbrd,
Thanks for trying to implement the suggestions.
Can you please provide me with 2 separate x-amzn-requestid header values you get back from 2 of your concurrent requests? (you need to enable response logging to see that)
The service team will need that to identify your requests and investigate.
Thanks! Ran~
@RanVaknin Here are the 25 requests from my latest run with the concurrent requests:
X-Amzn-Requestid: 1S8R9EJTU8JUT0TL1L00EFD6TBVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: HONNCUGOU343ONRA4RNU0B1J8VVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 0TUF3HN7UBEK3KNT9I26QLI6I3VV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: HAB7EABTT0VLBQ3V6QKPBO47NBVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 6V1NKOOU0K5NCPLJLR90E3AO1RVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 1472LU9G44L1Q8I45LUOVLTUQBVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 2U23SATL73L7LD4MUESDK75L77VV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: F5FNLEE9TKGIN889KOH00F172RVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 17M54JRIVU56V2G8THG290L6LBVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: OF04FB8S2BRN2UDB6H78IH4D07VV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: ITM8CCEB3KNN0D7N2OBGRSMVPVVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: NP4HUF0ARUMDL2I3HQVDTDJEHFVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 2JJUGCIIAUQ61B43FNCCPEQGPRVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 4CH4KM9SVQON1SMDS65C0UE4RNVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: EUP73Q4RP7ISVC0E318T3DE8KJVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: LBFVN786V0SLO8BF1DAKMCGM27VV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: AIM29AF8V7FMC3RUFEHOAJ6E4RVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 316JO58K4BAKQQNE6FC8QREU2NVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 8I0AOB0GMD0AETAJ1TLQLCCHFJVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 6JKJ0I5HN1CVCKVPNRMP384B0RVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 8N4S80T1SCD4KTRA69IL6Q1Q73VV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: 5Q89B5NMN7SOABCK5NCUK27VUFVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: R86UE8VVV8J4FKCUGK629IHHK3VV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: II3N7D8FKBQH9K52UNF34VG2VVVV4KQNSO5AEMVJF66Q9ASUAAJG X-Amzn-Requestid: O1PNFDO6GN186A3FCTNLURRPHBVV4KQNSO5AEMVJF66Q9ASUAAJG
Hi @jshlbrd ,
Thanks for that. Just to triple check, when you enable the request logging, you see the same principal / AWS Secret Access Key on sent on the request? The doc you linked specifies that the principal is what influences the caching mechanism.
Principal should look something like this: ASIA6ZK6KVPCPHI5DVVD
In my last provided log:
SDK 2023/05/16 11:40:33 DEBUG Request
POST / HTTP/1.1
...
...
...
Authorization: AWS4-HMAC-SHA256 Credential={{ PRINCIPAL }}/20230516/us-east-1/dynamodb/aws4_request,
...
...
...
Is the principal sent here with every request the same? (In my repro code it is the same) because if it is not, then that's what is throwing Dynamo off. If it is The same, then the problem is most likely with the service as I have previously mentioned.
Thanks again for your prompt responses.
All the best, Ran~
@RanVaknin That's correct, the principal is the same in every request. Thank you!
Hi @jshlbrd,
I created an internal ticket, we now have to wait for a response. My suggestion would be to use the iterative approach for now so you can save on billing costs.
Additionally if you see this is taking too long, feel free to reach out to AWS Support, they might be able to expedite this faster than I can.
I will keep you updated when I learn anything new. Thanks, Ran~
P88754518
Hi @jshlbrd ,
I finally heard back from the dynamo team:
I looked up the 25 request IDs, and those 25 requests were handled by 17 unique hosts, some of them handling more than 1 request.
When you kick off many concurrent requests to DDB, you're using HTTP/1.1, which means that each TCP connection can have 1 ongoing request at a time (usually true for HTTP/1.1 with some exceptions).
In this case it would open many connections to different hosts, each of which may not have the appropriate KMS material in their cache. This would in turn cause each unique host to request to KMS and would in turn cause the increased charges.
This is further explained by this little nuance (in bold) from the doc you shared:
Depending on the data being encrypted, DynamoDB identifies the right data encryption key to encrypt the data. To avoid having DynamoDB call KMS for every DynamoDB operation, the table key is cached for each principal in memory.
Since the goroutine will establish a new connection on a different host, in memory caching will occur per host.
To avoid the excess decrypt call I'd stick with invoking those operations sequentially rather than concurrently.
Thanks again for your patience.
All the best, Ran~
This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.