community icon indicating copy to clipboard operation
community copied to clipboard

MSK: operation error Kafka: CreateCluster, https response error StatusCode: 403

Open kappa8219 opened this issue 11 months ago • 7 comments

Describe the bug

arn:aws:iam::aws:policy/AmazonMSKFullAccess attached with Pod Identity results in:

{
  "level": "error",
  "ts": "2025-05-09T05:31:01.021Z",
  "msg": "Reconciler error",
  "controller": "cluster",
  "controllerGroup": "kafka.services.k8s.aws",
  "controllerKind": "Cluster",
  "Cluster": {
    "name": "cluster-name",
    "namespace": "ack-system"
  },
  "namespace": "ack-system",
  "name": "x",
  "reconcileID": "7680a7be-2523-4689-9268-0c04a18db412",
  "error": "operation error Kafka: CreateCluster, https response error StatusCode: 403, RequestID: 3bba50f8-f56f-4d73-a50f-23eef5249e01, api error AccessDeniedException: User: xxx is not authorized to perform: kafka:CreateCluster on resource: *",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:347\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255"
}

Steps to reproduce

Expected outcome Create cluster

Environment

  • Kubernetes version 1.31
  • Using EKS - yes
  • AWS service - MSK

kappa8219 avatar May 09 '25 05:05 kappa8219

Hello @kappa8219 👋 Thank you for opening an issue in ACK! A maintainer will triage this issue soon.

We encourage community contributions, so if you're interested in tackling this yourself or suggesting a solution, please check out our Contribution and Code of Conduct guidelines.

You can find more information about ACK on our website.

github-actions[bot] avatar May 09 '25 05:05 github-actions[bot]

Same as for closed Issue #2074

kappa8219 avatar May 09 '25 05:05 kappa8219

Hi! Please kindly check with full admin perms.

gecube avatar May 12 '25 07:05 gecube

Hi! Please kindly check with full admin perms.

It is strange but still the same with policy arn:aws:iam::aws:policy/AdministratorAccess

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "*",
            "Resource": "*"
        }
    ]
}
{
  "level": "error",
  "ts": "2025-05-13T07:42:48.221Z",
  "msg": "Reconciler error",
  "controller": "cluster",
  "controllerGroup": "kafka.services.k8s.aws",
  "controllerKind": "Cluster",
  "Cluster": {
    "name": "xxxx",
    "namespace": "ack-system"
  },
  "namespace": "ack-system",
  "name": "xxxxx",
  "reconcileID": "337fb40f-400e-4d38-92c7-e7e35b418320",
  "error": "operation error Kafka: CreateCluster, https response error StatusCode: 403, RequestID: 3a37877e-b0d6-428f-8d9e-63b5e1d42f17, api error AccessDeniedException: User: arn:aws:sts::xxxx:assumed-role/ack-controllers-kafka/eks-terra-clus-ack-contro-9cd673e7-892d-4e7d-8dac-ee1119aa67b0 is not authorized to perform: kafka:CreateCluster on resource: *",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:347\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255"
}

kappa8219 avatar May 13 '25 07:05 kappa8219

@kappa8219 did you restart controller after changing the perms? Very strange as it worked for me two weeks ago.

gecube avatar May 13 '25 09:05 gecube

@gecube Sure, restarted. I have pod identify as role appliance mechanism. But all other controllers assume same way, all fine.

kappa8219 avatar May 13 '25 09:05 kappa8219

ah... I used IRSA... I still did not switch to pod identity.

gecube avatar May 13 '25 09:05 gecube

Any updates on this investigation? I'm facing similar issues on my stack too

vinicius91 avatar Oct 22 '25 12:10 vinicius91

@vinicius91 I obsoleted MSK and switched to Strimzi (as we have EKS). I recommend to consider this option to everybody.

But if we are talking about MSK and it is mandatory requirement to use it - unfortunately, I did not progress with investigation.

gecube avatar Oct 22 '25 12:10 gecube

Strimzi was our first candidate to provision Kafka since we are also on EKS, but we decided to give MSK a try assuming that it would be a smoother experience since we already use other ack controllers, but so far it has been the opposite.

Would you say that Strimzi is on the same complexity level as the Provisioned Standard when it comes to storage management?

vinicius91 avatar Oct 22 '25 13:10 vinicius91

@vinicius91 I think that strimzi gives better experience.. at leas from the maintenance and cost perspective. But it could sound like anti-advertisement or advertisement against managed amazon services and msk in particular - no way :-)

gecube avatar Oct 22 '25 13:10 gecube

Yes, but it would nice if they put more effort into the product. This unaddressed issue here since 2024 is a bad advertisement in itself :(

vinicius91 avatar Oct 22 '25 13:10 vinicius91

Hello @kappa8219 @vinicius91 I was unable to replicate this issue. I'm also using PodIdentity with AmazonMSKFullAccess permission, and was able to create the msk Cluster successfully.. I'm using v1.2.1.

Not sure what could be the issue here..

michaelhtm avatar Oct 23 '25 17:10 michaelhtm

Hello @kappa8219 @vinicius91 I was unable to replicate this issue. I'm also using PodIdentity with AmazonMSKFullAccess permission, and was able to create the msk Cluster successfully.. I'm using v1.2.1.

Not sure what could be the issue here..

Hm, will also retry with the modern version, also 4.1 is out, it is interesting to see queues in Kafka.

kappa8219 avatar Oct 23 '25 17:10 kappa8219

Still no success :(

{
  "level": "error",
  "ts": "2025-10-24T05:19:33.504Z",
  "msg": "Reconciler error",
  "controller": "cluster",
  "controllerGroup": "kafka.services.k8s.aws",
  "controllerKind": "Cluster",
  "Cluster": {
    "name": "xxx-dev-eks-app",
    "namespace": "ack-system"
  },
  "namespace": "ack-system",
  "name": "xxx-dev-eks-app",
  "reconcileID": "8d29b64d-3355-4cb5-9fe0-ad72ec16cfd3",
  "error": "operation error Kafka: CreateCluster, https response error StatusCode: 403, RequestID: ID, api error AccessDeniedException: User: arn:aws:sts::xxx:assumed-role/ack-controllers-kafka/eks-terra-clus-ack-contro-xxx is not authorized to perform: kafka:CreateCluster on resource: *",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:347\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255"
}

Pod Identity association seems to be fine(at least for many other controllers such config works). Role is also looks attached with both AdministratorAccess and AmazonMSKFullAccess.

Here is my cluster config:

apiVersion: kafka.services.k8s.aws/v1alpha1
kind: Cluster
metadata:
  name: dev-eks-app
  namespace: ack-system
spec:
  name: dev-eks-app
  kafkaVersion: "4.1.x.kraft"
  numberOfBrokerNodes: 2
  brokerNodeGroupInfo:
    instanceType: "kafka.m7g.large"
    clientSubnets:
      - subnet-xxx
      - subnet-yyy
    securityGroups:
      - sg-zzz
    storageInfo:
      ebsStorageInfo:
        volumeSize: 200
        provisionedThroughput:
          enabled: false
  encryptionInfo:
    encryptionInTransit:
      clientBroker: "TLS_PLAINTEXT"
      inCluster: true
  enhancedMonitoring: "DEFAULT"
  loggingInfo:
    brokerLogs:
      cloudWatchLogs:
        enabled: true
        logGroup: dev-mks-logs-app
  configurationInfo:
    arn: "aws_msk_configuration.dev-cluster-configuration-eks-2node-kraft.arn"
    revision: 1
  openMonitoring:
    prometheus:
      jmxExporter:
        enabledInBroker: true
      nodeExporter:
        enabledInBroker: true
---
apiVersion: kafka.services.k8s.aws/v1alpha1
kind: Configuration
metadata:
  name: dev-cluster-configuration-eks-2node-kraft2
  namespace: ack-system
spec:
  name: "dev-cluster-configuration-eks-2node-kraft2"
  kafkaVersions:
    - "4.1.x.kraft"
  serverProperties: MY_HASH_CONFIG
---

Controller: public.ecr.aws/aws-controllers-k8s/kafka-controller:1.2.1

Values for eks, kafka controller helm chart which is creating pod identity assoc:

    eks:
      enabled: true
      aws:
        region: us-east-1
      deployment:
        tolerations:
        - effect: NoSchedule
          key: ng
          operator: Equal
          value: ccc
        nodeSelector:
          NodeType: ccc
      serviceAccount:
        annotations:
          eks.amazonaws.com/role-arn: arn:aws:iam::xxx:role/ack-eks-controller
        name: ack-eks-controller
    kafka:
      enabled: true
      aws:
        region: us-east-1
      deployment:
        tolerations:
        - effect: NoSchedule
          key: ng
          operator: Equal
          value: ctrls
        nodeSelector:
          NodeType: ctrls
      serviceAccount:
        annotations:
          eks.amazonaws.com/role-arn: arn:aws:iam::xxx:role/ack-controllers-kafka
        name: ack-controllers-kafka
apiVersion: eks.services.k8s.aws/v1alpha1
kind: PodIdentityAssociation
metadata:
  name: pod-identity-association-controllers-kafka
  namespace: ack-system
spec:
  clusterName: mycluster
  namespace: ack-system
  roleARN: arn:aws:iam::xxx:role/ack-controllers-kafka
  serviceAccount: ack-controllers-kafka

Finally the role:

apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
  name: ack-controllers-kafka
  namespace: ack-system
spec:
  name: ack-controllers-kafka
  assumeRolePolicyDocument: |
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
                  "Effect": "Allow",
                  "Principal": {
                      "Service": "pods.eks.amazonaws.com"
                  },
                  "Action": [
                      "sts:AssumeRole",
                      "sts:TagSession"
                  ]
              }
          ]
      }
  policies:
  - arn:aws:iam::aws:policy/AdministratorAccess
  - arn:aws:iam::aws:policy/AmazonMSKFullAccess

kappa8219 avatar Oct 24 '25 05:10 kappa8219

Update: I tried on EKS 1.34 - same thing. What confuses me is configuration creation works, so mechanism of role assume definetely fine. But what causes fail of CreateCluster is not clear.

Works:

apiVersion: kafka.services.k8s.aws/v1alpha1
kind: Configuration

Does not:

apiVersion: kafka.services.k8s.aws/v1alpha1
kind: Cluster

kappa8219 avatar Oct 24 '25 08:10 kappa8219

@michaelhtm, @vinicius91 any ideas wht to try? Maybe add some debug to the controller?

kappa8219 avatar Oct 24 '25 08:10 kappa8219

@kappa8219 can you try assuming the PodIdentity role in your terminal and try creating the cluster using aws cli? I just saw this https://repost.aws/questions/QUszJm_J6pR32y7qdpO9oAng/is-not-authorized-to-perform-kafka-createcluster where someone is running into the same issue when using the cli..

michaelhtm avatar Oct 24 '25 17:10 michaelhtm

@michaelhtm Interesting case in repost, but not mine.

I can create cluster both with aws web console and cli:

aws kafka create-cluster     --cluster-name test-msk2     --kafka-version "4.1.x.kraft"     --number-of-broker-nodes 2     --broker-node-group-info '{
        "InstanceType": "kafka.m5.large",
        "ClientSubnets": ["subnet-x", "subnet-y"]
    }'
{
    "ClusterArn": "arn:aws:kafka:us-east-1:xxx:cluster/test-msk2/xxx",
    "ClusterName": "test-msk2",
    "State": "CREATING"
}

Still from ACK controller - fail.

One more thing I remember that pod identity works only when using aws v2 connect library. But this controller based on quite resent ACK runtime so libraries should be up to date.

kappa8219 avatar Oct 27 '25 08:10 kappa8219

@michaelhtm, @vinicius91 I finally discovered what was wrong. The:

configurationInfo:
    arn: "FULL_CONFIG_ARN"

I got here not correct value, sorry, my bad. One thing for my excuse is the error message, it is not correct for the case. Maybe linking Configuration by name, not by ARN would help avoid such cases.

kappa8219 avatar Nov 06 '25 12:11 kappa8219

Thanks for the update folks, I'll give it a try on my end

vinicius91 avatar Dec 01 '25 15:12 vinicius91