temporal icon indicating copy to clipboard operation
temporal copied to clipboard

Temporal servers crashes once the aws token session is expired

Open mazzy89 opened this issue 2 years ago • 6 comments

Expected Behavior

Using the AWS RDS IAM Auth plugin and assuming an IAM role to generate the db password, I would expect that Temporal reconnects to the db.

Actual Behavior

With an IAM role with 1h of duration session (default value), once the session is expired Temporal crashes and can't reach the DB.

From the feature implementation https://github.com/temporalio/temporal/pull/2830, I cannot see any mechanism used by Temporal to re-hydrate the token.

Steps to Reproduce the Problem

Run Temporal 1.17 with the process configured with auth plugin rds-iam-auth to assume a role.

Specifications

  • Version: 1.17
  • Platform: k8s

mazzy89 avatar Jul 08 '22 15:07 mazzy89

Once a RDS connection is established it remains valid - even after the underlying session tokens are expired. On every new session the AWS sdk is invoked to get credentials, such that no token is stored or cached and credentials are resolved at the time of session creation.

My hunch is that you're not using a rotating token mechanism that can be resolved by the AWS SDK. It would be helpful to show which credential mechanism you're using and the error logs you're seeing.

gnz00 avatar Jul 08 '22 17:07 gnz00

@gnz00 could you point to any resource that claim that the connection remain valid even after the aws session is expired? It's not mentioned anywhere on the aws side.

I'm using the official eks aws oidc provider https://aws.amazon.com/blogs/containers/introducing-oidc-identity-provider-authentication-amazon-eks/ which support secret rotation automatically. So I doubt is that the issue.

mazzy89 avatar Jul 08 '22 18:07 mazzy89

About the error I see is the following:

{"level":"error","ts":"2022-07-08T18:15:32.681Z","msg":"Operation failed with internal error.","error":"GetMetadata operation failed. Error: pq: PAM authentication failed for user \"temporal\"","metric-scope":55,"logging-call-at":"persistenceMetricClients.go:1424","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1424\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).GetMetadata\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:861\ngo.temporal.io/server/common/namespace.(*registry).refreshNamespaces\n\t/home/builder/temporal/common/namespace/registry.go:422\ngo.temporal.io/server/common/namespace.(*registry).refreshLoop\n\t/home/builder/temporal/common/namespace/registry.go:399\ngo.temporal.io/server/internal/goro.Go.func1\n\t/home/builder/temporal/internal/goro/goro.go:56"}

mazzy89 avatar Jul 08 '22 18:07 mazzy89

To add more context from official documents https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-technical-overview.html

Here the explanation from aws how the refresh works

The kubelet requests and stores the token on behalf of the pod. By default, the kubelet refreshes the token if it is older than 80 percent of its total TTL, or if the token is older than 24 hours. You can modify the expiration duration for any account, except the default service account, with settings in your pod spec. For more information, see Service Account Token Volume Projection in the Kubernetes documentation

mazzy89 avatar Jul 08 '22 19:07 mazzy89

I think you're right - I'd switched to password auth as I couldn't wait for the 17.0 release. It seems sqlx does cache the DSN and re-use it on connection failures. I don't think either the Postgres or Mysql drivers support any form of auth plugin. I'll probably revert the original PR in favor of using a sidecar proxy.

gnz00 avatar Jul 09 '22 01:07 gnz00

And to add also to the issue, I get the same result even when using an AWS IAM user with acess key id and secret access key.

mazzy89 avatar Jul 09 '22 05:07 mazzy89

@mazzy89 - is this issue resolved with 1.17.0 ? Are you able to rotate the password?

jaffarsadikk avatar Sep 09 '22 12:09 jaffarsadikk

No. Simply the author reverted the feature in 1.17. This feature can't land implemented in that way for the reasons already explained.

mazzy89 avatar Sep 09 '22 12:09 mazzy89

Close this issue as the AWS RDS IAM Auth is not supported currently.

yiminc avatar Sep 10 '22 00:09 yiminc

@mazzy89 thanks for pulling on this thread. I was about to make a horrible mistake

@yiminc do you know of any plans to reintroduce this feature?

sialm avatar Oct 06 '22 21:10 sialm

@yiminc Any plan to reintroduce this feature?

CurryFishBalls9527 avatar May 23 '23 18:05 CurryFishBalls9527

@CurryFishBalls9527 @sialm You would need to extend each driver to override the Connect method to always fetch a new token, here is an example for PG: https://github.com/aws/aws-sdk-go/issues/3043#issuecomment-581931580.

Alternatively, you could force a maxConns to 1 and maybe recreate the Session for each store on a connection failure.

gnz00 avatar May 24 '23 21:05 gnz00

I would also like to know if there are plans to re-introduce this feature

dkravetz avatar Jun 28 '23 11:06 dkravetz