kiam icon indicating copy to clipboard operation
kiam copied to clipboard

kiam-server broken in v4.1

Open mestuddtc opened this issue 4 years ago • 5 comments

I had a working configuration using v4.0, but v4.1 is broken. I am running outside of AWS, so autodetecting roles is not an option.

These are the server parameters, working in v4.0.

      containers:
      - args:
        - server
        - --json-log
        - --level=debug
        - --bind=0.0.0.0:443
        - --cert=/etc/kiam/tls/server.pem
        - --key=/etc/kiam/tls/server-key.pem
        - --ca=/etc/kiam/tls/ca.pem
        - --role-base-arn=arn:aws:iam::682359534587:role/
        - --sync=1m
        - --prometheus-listen-addr=0.0.0.0:9620
        - --prometheus-sync-interval=5s

With v4.1 or master, the server logs:

{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2021-07-21T19:39:51Z"}
{"level":"info","msg":"starting server","time":"2021-07-21T19:39:51Z"}
{"level":"fatal","msg":"error using AWS STS Gateway: role can't be empty","time":"2021-07-21T19:39:51Z"}

mestuddtc avatar Jul 21 '21 19:07 mestuddtc

Took a look at this one. This change is responsible: https://github.com/uswitch/kiam/commit/8103483280953838e3fa629f04d9e8d44b22df3c

I unsure if this is a bug. Before the change above it was assumed server pods were running on a node with the appropriate IAM permissions; the STS gateway config would just not provide any credentials. Now it appears --assume-role-arn must be provided and set to a role the node can assume. That role must also have the necessary permissions for everything to work.

If this is the intended behavior then the docs should be updated to reflect that and the helm chart should be updated to require a value for assumeRoleArn because just providing a role base ARN (either via --role-base-arn or --role-base-arn-autodetect) is not enough now.

dmorgan81 avatar Nov 05 '21 15:11 dmorgan81

I would consider it a bug for two reasons:

  1. it is an incompatible, breaking change in a minor update. If it was intended, the version should have been 5.0, should it not?
  2. the credentials are being provided to kiam-server via IRSA annotations (earlier I think access key and secret), which explicitly give the permissions to do its work. It doesn't make sense to me that I would have to add another layer of IAM roles and permissions.

mestuddtc avatar Nov 05 '21 16:11 mestuddtc

I agree with you on both points. I don't like to think that both of us were leveraging undefined behavior that just happened to work in v4.0.

Hopefully a KIAM dev can chime in on whether https://github.com/uswitch/kiam/commit/8103483280953838e3fa629f04d9e8d44b22df3c unintentionally broke our use case. Fixing it wouldn't be difficult; just check if b.config.AssumeRoleArn is empty and skip trying to resolve the role if so.

dmorgan81 avatar Nov 05 '21 19:11 dmorgan81

Same issue after upgrading to v4.2, but the server is running inside AWS:

{"level":"info","msg":"starting server","time":"2022-02-23T19:15:04Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2022-02-23T19:15:04Z"}
{"level":"info","msg":"detecting arn prefix","time":"2022-02-23T19:15:04Z"}
{"level":"info","msg":"using detected prefix: arn:aws:iam::xxxxxxxxxxxx:role/","time":"2022-02-23T19:15:04Z"}
{"level":"fatal","msg":"error using AWS STS Gateway: role can't be empty","time":"2022-02-23T19:15:04Z"}

Config:

    spec:
      containers:
      - args:
        - --json-log
        - --level=info
        - --bind=0.0.0.0:443
        - --cert=/etc/kiam/tls/tls.crt
        - --key=/etc/kiam/tls/tls.key
        - --ca=/etc/kiam/tls/ca.crt
        - --role-base-arn-autodetect
        - --session-duration=15m
        - --sync=1m
        - --prometheus-listen-addr=0.0.0.0:9620
        - --prometheus-sync-interval=5s
        command:
        - /kiam
        - server

Downgrading to v4.0 resolves the error:

{"level":"info","msg":"starting server","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"detecting arn prefix","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"using detected prefix: arn:aws:iam::xxxxxxxxxxxx:role/","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"detecting arn prefix","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"using detected prefix: arn:aws:iam::xxxxxxxxxxxx:role/","time":"2022-02-23T19:17:20Z"}
{"level":"info","msg":"will serve on 0.0.0.0:443","time":"2022-02-23T19:17:20Z"}

stephan2012 avatar Feb 23 '22 19:02 stephan2012

We "fixed" this by setting server.assumeRoleArn in the helm chart to the IAM role our nodes already as. Silly but it does restore behavior to what it was in 4.0.

dmorgan81 avatar Feb 23 '22 19:02 dmorgan81