aws-sdk-cpp
aws-sdk-cpp copied to clipboard
[aws-cpp-sdk-core]: increase STS reliability and retries
This fixes issues we have repeatedly experienced when using STS for authentication in a large Kubernetes cluster, with heavy load on STS:
- The default connect timeout of 1s is too low. It happens that connections slow down. One case is very high load on kube DNS. A value of 30 seconds has proven to be robust.
- The retry parameters are too short, authentication would frequently fail whenever STS was under higher load. The retry settings have worked in production for about 2 years.
Check all that applies:
- [X] Did a review by yourself.
- [X] Added proper tests to cover this PR. (If tests are not applicable, explain.) We have before/after experience. Without these settings, calls to STS would frequently time out whenever STS was under higher load. Within the close to 2 years after making this change, we no longer experienced authentication failures when STS was under load.
- [X] Checked if this PR is a breaking (APIs have been changed) change.
- [X] Checked if this PR will not introduce cross-platform inconsistent behavior.
- [X]] Checked if this PR would require a ReadMe/Wiki update.
Check which platforms you have built SDK on to verify the correctness of this PR.
- [X] Linux
- [ ] Windows
- [ ] Android
- [ ] MacOS
- [ ] IOS
- [ ] Other Platforms
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.