terraform-provider-kafka icon indicating copy to clipboard operation
terraform-provider-kafka copied to clipboard

AWS MSK fresh cluster first apply fails because SASLS SCRAM secrets association is delayed

Open eroznik opened this issue 1 year ago • 0 comments

Hello,

Everything described below is implemented within 1 TF module that uses multiple providers and external modules. The module is applied in one run. It is also important to note that this happens only on the first run(e.g. fresh apply/install), if the user re-applies the stack, it passes successfully. So the re-apply can be considered as a workaround, but less than ideal.

The approximate TF apply sequence/scenario is:

  1. create MSK cluster(SASL SCRAM auth enabled)
  2. create the AWS SM secret for the "kafka-provider" user
  3. associated the secrets with the MSK cluster
  4. apply Kafka provider resources(topics, acls,..)

What happens is that even though the resource dependencies are clearly defined and properly detected by TF, thus each of the steps above executed one after another, there is just not enough of a "delay" between steps 3 and 4 for MSK to register the secret association and allow the freshly associated SASL SCRAM user to auth with the cluster.

Let me add a few logs with timestamps to clarify the problem a bit more:

Terraform:

[2023-12-12T13:42:59.268200Z] module.xxx.msk-association: Creation complete after 1s
[2023-12-12T13:42:59.336920Z] module.xxx.msk-topic-A: Creating...
[2023-12-12T13:43:00.648721Z] module.xxx.msk-topic-A: Creation errored after 2s
[2023-12-12T13:43:01.053532Z ]Error: kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n

MSK:

[2023-12-12 13:42:59,538] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:42:59,538] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]
[2023-12-12 13:42:59,960] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:42:59,960] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]
[2023-12-12 13:43:00,294] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:43:00,294] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]
[2023-12-12 13:43:00,629] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:43:00,629] WARN Retrieved null credential for user: kafka-provider. [INTERNAL]

After a re-run approximately 10min later, we can se that MSK got the kafka-provider secret and the apply from Kafka provider went through:

[2023-12-12 13:53:06,635] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:53:06,683] INFO Retrieving credential for user: kafka-provider [INTERNAL]
[2023-12-12 13:53:07,615] INFO Processing Acl change notification for ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL), versionedAcls : Set(User:kafka-provider has ALLOW permission for operations: ALL from hosts: *), zkVersion : 0 (kafka.security.authorizer.AclAuthorizer)

With #251 we got the ability to run plans when the brokers aren't yet available, and that was a great improvement. But as of now, are there any suggested patterns/solutions to the problem described above? Is it "expected" to have Kafka provider in a separate run/pipeline? Maybe somehow delay the execution?

Also worth nothing that AWS has a guarantee of associating the secret within 10min, so the "required" delay might be as much as 10min long.

eroznik avatar Dec 12 '23 14:12 eroznik