argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

SSO RBAC in 3.4 with managed namespace works differently

Open simox-83 opened this issue 3 years ago • 22 comments

Pre-requisites

  • [X] I have double-checked my configuration
  • [X] I can confirm the issues exists when I tested with :latest
  • [ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I upgraded Argo Workflow from 3.1.13 to 3.4.3. SSO Authentication was working fine with 3.1.13; however, the 3.4.3 doesn't seem to work. The SSO configuration (Okta) has not changed.

When I tried to open the UI and click on login on the SSO, I get a red banner on the down right corner saying Failed to load version/info Error: Unauthorized. After that the web page just tries to load and after sometime it replies with test-ce-argo-server-integration.k8s.cnqr.tech didn't send any data.ERR_EMPTY_RESPONSE

Version

3.4.3

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A - The issue happens at login time, so I can't run any workflow.

Logs from the workflow ~controller~ server

I am attaching the logs of the workflow server, because the error happens during authentication:

time="2022-11-07T13:37:32.115Z" level=info duration=2.752873ms method=GET path=/main.2430295409b8b54e52ad.js size=1471060 status=0
time="2022-11-07T13:37:32.117Z" level=info duration="23.156µs" method=GET path=index.html size=0 status=304
time="2022-11-07T13:37:33.935Z" level=info msg="finished unary call with code Unauthenticated" error="rpc error: code = Unauthenticated desc = token not valid for running mode" grpc.code=Unauthenticated grpc.method=GetUserInfo grpc.service=info.InfoService grpc.start_time="2022-11-07T13:37:33Z" grpc.time_ms=0.047 span.kind=server system=grpc
time="2022-11-07T13:37:33.935Z" level=info msg="finished unary call with code Unauthenticated" error="rpc error: code = Unauthenticated desc = token not valid for running mode" grpc.code=Unauthenticated grpc.method=GetInfo grpc.service=info.InfoService grpc.start_time="2022-11-07T13:37:33Z" grpc.time_ms=0.028 span.kind=server system=grpc
time="2022-11-07T13:37:33.935Z" level=info duration=1.628208ms method=GET path=/api/v1/userinfo size=56 status=401
time="2022-11-07T13:37:33.935Z" level=info duration=2.284157ms method=GET path=/api/v1/info size=56 status=401
time="2022-11-07T13:37:34.116Z" level=info duration="202.575µs" method=GET path=/assets/fonts/fa-solid-900.woff2 size=150472 status=0
time="2022-11-07T13:37:34.116Z" level=info duration="111.014µs" method=GET path=/assets/images/logo.png size=41464 status=0
time="2022-11-07T13:37:34.389Z" level=info msg="finished unary call with code Unauthenticated" error="rpc error: code = Unauthenticated desc = token not valid for running mode" grpc.code=Unauthenticated grpc.method=CollectEvent grpc.service=info.InfoService grpc.start_time="2022-11-07T13:37:34Z" grpc.time_ms=0.03 span.kind=server system=grpc
time="2022-11-07T13:37:34.389Z" level=info duration="438.543µs" method=POST path=/api/v1/tracking/event size=56 status=401
time="2022-11-07T13:37:36.819Z" level=info duration="67.668µs" method=GET path=index.html size=473 status=0
time="2022-11-07T13:37:56.819Z" level=info duration="68.792µs" method=GET path=index.html size=473 status=0
time="2022-11-07T13:38:16.819Z" level=info duration="74.393µs" method=GET path=index.html size=473 status=0
time="2022-11-07T13:38:36.819Z" level=info duration="83.064µs" method=GET path=index.html size=473 status=0
time="2022-11-07T13:38:56.819Z" level=info duration="68.599µs" method=GET path=index.html size=473 status=0
time="2022-11-07T13:39:16.819Z" level=info duration="81.42µs" method=GET path=index.html size=473 status=0
time="2022-11-07T13:39:36.819Z" level=info duration="72.698µs" method=GET path=index.html size=473 status=0

Logs from in your workflow's wait container

N/A

This is the service account configured for RBAC:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: admin-user
  annotations:
    # The rule is an expression used to determine if this service account
    # should be used.
    # * `groups` - an array of the OIDC groups
    # * `iss` - the issuer ("argo-server")
    # * `sub` - the subject (typically the username)
    # Must evaluate to a boolean.
    # If you want an account to be the default to use, this rule can be "true".
    # Details of the expression language are available in
    # https://github.com/antonmedv/expr/blob/master/docs/Language-Definition.md.
    workflows.argoproj.io/rbac-rule: "true"
    # The precedence is used to determine which service account to use whe
    # Precedence is an integer. It may be negative. If omitted, it defaults to "0".
    # Numerically higher values have higher precedence (not lower, which maybe
    # counter-intuitive to you).
    # If two rules match and have the same precedence, then which one used will
    # be arbitrary.
    workflows.argoproj.io/rbac-rule-precedence: "0"

simox-83 avatar Nov 07 '22 13:11 simox-83

Please note that this bug is present also with version 3.4.2. I rolled back to 3.1.13 and it's working again.

If I compare the logs, it looks like the issue is the 401 returned when calling the /api endpoints.

simox-83 avatar Nov 08 '22 16:11 simox-83

@sarabala1979 thanks for looking at it. Please let me know if you want to discuss it with a live demo. We can book some time and I can share with you what I see. Thank you.

simox-83 avatar Nov 09 '22 13:11 simox-83

it still works for me in 3.4.3 , I use Dex not okta

tooptoop4 avatar Nov 12 '22 22:11 tooptoop4

I thought that I was also affected by this issue or something similar. But for me the problem was running Kubernetes 1.25. Starting with Kubernetes 1.24 service account tokens are no longer generated automatically and I had to create an empty secret with appropriate annotation to get the token that Argo Workflows tries to read. See https://github.com/argoproj/argo-workflows/blob/master/docs/manually-create-secrets.md.

I was getting this error message in the server's logfile:

time="2022-11-14T17:12:21.485Z" level=error msg="failed to perform RBAC authorization" error="failed to get service account secret: secrets \"argo-workflows-server.service-account-token\" not found"

Leaving this note as it might help someone else who's searching through the issues.

elemental-lf avatar Nov 14 '22 18:11 elemental-lf

Thanks for your update @elemental-lf - in my case I am using Kubernetes 1.19 so I shouldn't be affected. But thanks for pointing this out, I'd missed this info in my initial post.

simox-83 avatar Nov 15 '22 09:11 simox-83

The red banner appears before you’re logged into. Can you try deleting cookies and logging back in? Ignore the banner.

alexec avatar Nov 16 '22 15:11 alexec

I actually did this by using Chrome in Incognito mode and I get the Okta page back. But once I try to login, it just spins and then it returns test-ce-argo-server-integration.k8s.cnqr.tech didn't send any data.ERR_EMPTY_RESPONSE

We haven't changed anything on the Okta side, so I guess we are missing something in the request?

simox-83 avatar Nov 16 '22 16:11 simox-83

I think this might be fixed by #10046

simox-83 avatar Nov 18 '22 09:11 simox-83

My k8s version is: v1.23.10. And using argo server latest image with digest sha256:744501b36420f42eb33628206449bce4654604046baf19b193cbae4b25621291. I am still stuck on this issue.

My SSO server is the Argo CD dex.

The browse reports 401 with /api/v1/userinfo

LinuxSuRen avatar Dec 06 '22 02:12 LinuxSuRen

I thought that I was also affected by this issue or something similar. But for me the problem was running Kubernetes 1.25. Starting with Kubernetes 1.24 service account tokens are no longer generated automatically and I had to create an empty secret with appropriate annotation to get the token that Argo Workflows tries to read. See https://github.com/argoproj/argo-workflows/blob/master/docs/manually-create-secrets.md.

I was getting this error message in the server's logfile:

time="2022-11-14T17:12:21.485Z" level=error msg="failed to perform RBAC authorization" error="failed to get service account secret: secrets \"argo-workflows-server.service-account-token\" not found"

Leaving this note as it might help someone else who's searching through the issues.

@LinuxSuRen could this be your case?

vitalyrychkov avatar Dec 23 '22 15:12 vitalyrychkov

I don't how. But it works now. Thank @vitalyrychkov

LinuxSuRen avatar Jan 04 '23 01:01 LinuxSuRen

FYI- SSO seems to work in v3.4.4 in single namespace but not managed namespace mode. This makes me skeptical about http proxy fix. I have not checked cluster install. The "latest" images for workflows do not seem to fix this issue yet. Have http proxy fix been included in the latest image?

It worked in v3.3.5

apiwoni avatar Jan 14 '23 00:01 apiwoni

I think this might be fixed by #10046

@simox-83 Were you able to verify that this fix resolved your issue? What is install mode you have: namespace, cluster or managed namespace?

apiwoni avatar Jan 14 '23 01:01 apiwoni

OK. Here's the issue, I think, which has nothing to do with proxy.

In v3.3.5 I have been able to configure SSO RBAC by defining role and binding in target namespace to annotated service account in server namespace and it worked. This no longer works in v3.4.4.

In v3.4.4 I have to to configure SSO RBAC by defining role and binding in target namespace to annotated service account ALSO in target namespace instead of server namespace. This SSO RBAC configuration does not work in v3.3.5

Whether or not I defined SSO_DELEGATE_RBAC_TO_NAMESPACE=true had no bearing in either case.

apiwoni avatar Jan 14 '23 03:01 apiwoni

@simox-83 Can you confirm that this has been resolved in the latest versions? We might be able to patch 3.3, but it's unlikely. We want to make sure it was fixed by #10046 and is working in 3.4

JPZ13 avatar Feb 23 '23 18:02 JPZ13

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Mar 25 '23 06:03 stale[bot]

OK. Here's the issue, I think, which has nothing to do with proxy.

In v3.3.5 I have been able to configure SSO RBAC by defining role and binding in target namespace to annotated service account in server namespace and it worked. This no longer works in v3.4.4.

In v3.4.4 I have to to configure SSO RBAC by defining role and binding in target namespace to annotated service account ALSO in target namespace instead of server namespace. This SSO RBAC configuration does not work in v3.3.5

Whether or not I defined SSO_DELEGATE_RBAC_TO_NAMESPACE=true had no bearing in either case.

I faced same issue and we are not using proxy. We use managed namespace and we had to move the service account and the bindings from argo server namespace to the managed namespace in order to make it work for upgrading from 3.3 to 3.4.

kiddo3 avatar Apr 12 '23 14:04 kiddo3

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Jun 18 '23 04:06 stale[bot]

Not stale. Needs fixing

JPZ13 avatar Jun 21 '23 11:06 JPZ13

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Sep 17 '23 11:09 stale[bot]

This issue seems to have become a hodgepodge collection of different SSO configuration issues, which is hard to be actionable and often missing reproduction details. As such I'm inclined to close this out. If you have a specific SSO issue, please file a new bug report with a reproducible configuration showing the bug.

  1. I think this might be fixed by #10046

    OP's issue might have been fixed by this proxy change. OP never responded. But if they weren't sure, then it's hard to say what the root cause was to begin with.

  2. But for me the problem was running Kubernetes 1.25. Starting with Kubernetes 1.24 service account tokens are no longer generated automatically and I had to create an empty secret with appropriate annotation to get the token that Argo Workflows tries to read. See https://github.com/argoproj/argo-workflows/blob/master/docs/manually-create-secrets.md.

    I was getting this error message in the server's logfile:

    time="2022-11-14T17:12:21.485Z" level=error msg="failed to perform RBAC authorization" error="failed to get service account secret: secrets \"argo-workflows-server.service-account-token\" not found"
    

    This appears to have been the problem for several people in this thread as well, and is unrelated to OP. The SA Secrets docs are now here (permalink): https://argo-workflows.readthedocs.io/en/release-3.5/service-account-secrets/

  3. The red banner appears before you’re logged into. Can you try deleting cookies and logging back in? Ignore the banner.

    This is also a common issue. The banner and error message is not indicative of the root cause.

    More logging was added in #11370, so if you get this and think it may be due to a misconfiguration and not just an invalid or expired token, check your Server logs preceding this error.

    We may remove this banner message due to being too generic and sometimes counter-productive per #12070 and #12168. I need to investigate more if we can possibly disambiguate the error better at that phase or prior (most SSO errors happen during the callback which precedes the login, hence the preceding logs mentioned above).

  4. In v3.4.4 I have to to configure SSO RBAC by defining role and binding in target namespace to annotated service account ALSO in target namespace instead of server namespace. This SSO RBAC configuration does not work in v3.3.5

    Whether or not I defined SSO_DELEGATE_RBAC_TO_NAMESPACE=true had no bearing in either case.

    I faced same issue and we are not using proxy. We use managed namespace and we had to move the service account and the bindings from argo server namespace to the managed namespace in order to make it work for upgrading from 3.3 to 3.4.

    This managed namespace change -- without delegation -- sounds like a potential regression. I couldn't find in the 3.4 changelog where that might have happened though, nor by looking through the code. From the blame, only thing I can think of off the top of my head is that #8555 maybe had a bug?

    Problematically, that appears to have also been a breaking change, one that has persisted to 3.5 too 😕. Fixing that would result in another breaking change 😕

agilgur5 avatar Apr 25 '24 18:04 agilgur5

From the blame, only thing I can think of off the top of my head is that #8555 maybe had a bug?

Problematically, that appears to have also been a breaking change, one that has persisted to 3.5 too 😕. Fixing that may result in another breaking change 😕

Yep, that PR appears to have caused a completely undocumented breaking change regression 😕 See my comments in https://github.com/argoproj/argo-workflows/pull/8555#discussion_r1579963621

That is pretty confusing behavior for managed namespaces, so I'm inclined to change it back... but two breaking changes are not great either...

We could patch both 3.4.x and 3.5.x, but it'd be a breaking patch then... 😕

agilgur5 avatar Apr 25 '24 18:04 agilgur5

Discussed in today's Contributor Meeting and the consensus was that we would add a note to the 3.4 upgrading guide about this unintentional bug / breaking change to SSO RBAC with managed namespaces, and then fix it in 3.6 with another note. Since this bug has existed for a while now (the entirety of 3.4 and 3.5), we don't want to break folks again in a patch release, so doing it in a minor will make things more clear

agilgur5 avatar Jun 19 '24 02:06 agilgur5

I'm a bit confused as to the current state of this feature. The issue is marked with "solution/workaround", but I don't think I understand. I can't seem to get namespace delegation to work when I put the service accounts into the rbac managed namespaces. It only works when my service accounts are in the workflow server namespace.

sstaley-hioscar avatar Aug 04 '24 01:08 sstaley-hioscar

I can't seem to get namespace delegation

Yes, namespace delegation specifically still works correctly. But if you turn it off and have a managed namespace in 3.4 or 3.5, your SAs will still have to be in the managed namespace. (and moving them there is a workaround). See also my PR comment as the most clear, isolated comment: https://github.com/argoproj/argo-workflows/pull/8555#discussion_r1579963621

Or also, from an earlier comment above:

Whether or not I defined SSO_DELEGATE_RBAC_TO_NAMESPACE=true had no bearing in either case.

^That should not be the case and is a bug.

(also I deleted the comment I made a few min before this as I misread)

agilgur5 avatar Aug 04 '24 03:08 agilgur5

You also might be able to workaround it by removing the managed namespace flag of the Server and make it cluster-level, but keep its RBAC only for the managed namespace. I haven't tried that though

agilgur5 avatar Aug 04 '24 03:08 agilgur5

Hey @agilgur5 thank you for the quick response. I was able to get it working with quite a bit of troubleshooting of some issues that were mostly due to my helm charts. I'll share my issues here in case it helps anyone else.

  1. I didn't have the annotation workflows.argoproj.io/service-account-token.name present on the service accounts
  2. The namespace rolebinding needs to be placed in the managed namespace via the namespace metadata flag and must reference the server service account via its namespace in the "subjects" reference (Potentially obvious, but I missed a field)
  3. This was a tricky one. My helm templates were turning the workflows.argoproj.io/rbac-rule into double single quotes, i.e. '''engineering_infra_platform'' in groups' instead of "engineering_infra_platform" in groups. Make sure this is configured exactly. Oddly, this had previously been working in the non-namespaced mode.
  4. Another helm templating mistake. The workflows.argoproj.io/rbac-rule-precedence annotation was being rendered as "100" vs 100. This was due to a refactor I made where I was placing the value in a dictionary and referencing it. What was particularly tricky was that it did not show up any differently in k9s via the default view. It only showed up when I described the object.

sstaley-hioscar avatar Aug 04 '24 17:08 sstaley-hioscar