temporal
temporal copied to clipboard
Temporal servers crashes once the aws token session is expired
Expected Behavior
Using the AWS RDS IAM Auth plugin and assuming an IAM role to generate the db password, I would expect that Temporal reconnects to the db.
Actual Behavior
With an IAM role with 1h of duration session (default value), once the session is expired Temporal crashes and can't reach the DB.
From the feature implementation https://github.com/temporalio/temporal/pull/2830, I cannot see any mechanism used by Temporal to re-hydrate the token.
Steps to Reproduce the Problem
Run Temporal 1.17 with the process configured with auth plugin rds-iam-auth
to assume a role.
Specifications
- Version: 1.17
- Platform: k8s
Once a RDS connection is established it remains valid - even after the underlying session tokens are expired. On every new session the AWS sdk is invoked to get credentials, such that no token is stored or cached and credentials are resolved at the time of session creation.
My hunch is that you're not using a rotating token mechanism that can be resolved by the AWS SDK. It would be helpful to show which credential mechanism you're using and the error logs you're seeing.
@gnz00 could you point to any resource that claim that the connection remain valid even after the aws session is expired? It's not mentioned anywhere on the aws side.
I'm using the official eks aws oidc provider https://aws.amazon.com/blogs/containers/introducing-oidc-identity-provider-authentication-amazon-eks/ which support secret rotation automatically. So I doubt is that the issue.
About the error I see is the following:
{"level":"error","ts":"2022-07-08T18:15:32.681Z","msg":"Operation failed with internal error.","error":"GetMetadata operation failed. Error: pq: PAM authentication failed for user \"temporal\"","metric-scope":55,"logging-call-at":"persistenceMetricClients.go:1424","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1424\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).GetMetadata\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:861\ngo.temporal.io/server/common/namespace.(*registry).refreshNamespaces\n\t/home/builder/temporal/common/namespace/registry.go:422\ngo.temporal.io/server/common/namespace.(*registry).refreshLoop\n\t/home/builder/temporal/common/namespace/registry.go:399\ngo.temporal.io/server/internal/goro.Go.func1\n\t/home/builder/temporal/internal/goro/goro.go:56"}
To add more context from official documents https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-technical-overview.html
Here the explanation from aws how the refresh works
The kubelet requests and stores the token on behalf of the pod. By default, the kubelet refreshes the token if it is older than 80 percent of its total TTL, or if the token is older than 24 hours. You can modify the expiration duration for any account, except the default service account, with settings in your pod spec. For more information, see Service Account Token Volume Projection in the Kubernetes documentation
I think you're right - I'd switched to password auth as I couldn't wait for the 17.0 release. It seems sqlx does cache the DSN and re-use it on connection failures. I don't think either the Postgres or Mysql drivers support any form of auth plugin. I'll probably revert the original PR in favor of using a sidecar proxy.
And to add also to the issue, I get the same result even when using an AWS IAM user with acess key id and secret access key.
@mazzy89 - is this issue resolved with 1.17.0 ? Are you able to rotate the password?
No. Simply the author reverted the feature in 1.17. This feature can't land implemented in that way for the reasons already explained.
Close this issue as the AWS RDS IAM Auth is not supported currently.
@mazzy89 thanks for pulling on this thread. I was about to make a horrible mistake
@yiminc do you know of any plans to reintroduce this feature?
@yiminc Any plan to reintroduce this feature?
@CurryFishBalls9527 @sialm You would need to extend each driver to override the Connect method to always fetch a new token, here is an example for PG: https://github.com/aws/aws-sdk-go/issues/3043#issuecomment-581931580.
Alternatively, you could force a maxConns to 1 and maybe recreate the Session for each store on a connection failure.
I would also like to know if there are plans to re-introduce this feature