argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

fix: remove UID from securitycontext, redundant to UID in Dockerfile Fixes #14279

Open antoinetran opened this issue 10 months ago • 10 comments

Fixes #14279

Motivation

Argo should work easily in OpenShift and non-OpenShift environment. Currently, the UID 8737 is hard-coded in code, both in Dockerfile and in YAML resource to apply. This is redundant, and makes it fail in OpenShift, where a pod policy such as "UID must be in range 10000000-10010000" . This pod policy can also be dynamic, per OpenShift project, so another one might says "UID must be in range 20000000-20010000". So it is a nightmare to deploy ARGO and specifies each UID, while NOT settting the runAsUser UID in YAML resource, allows Kubernetes to use the default one in Dockerfile in non-OpenShift environment, and use the default high UID in the range in OpenShift environment.

Modifications

Just removed the UID in runAsUser by setting it to nil.

Verification

Documentation

This changes nothing in use, and override of this UID for artgc pod has not yet been documented I believe. But from the other issue I found, the override with PodSpecPatch (described in my issue #14279 workaround) works before and after this pullrequest.

antoinetran avatar Mar 10 '25 09:03 antoinetran

FYI, this is where Dockerfile hard-code 8737 https://github.com/argoproj/argo-workflows/blob/v3.6.4/Dockerfile#L108, and this is ok, as long as this is not specified again in securityContext.

antoinetran avatar Mar 10 '25 09:03 antoinetran

Hi @Garett-MacGowan , I have an issue with one test with my change. I can see you wrote that test. I have no idea how to debug, I mean there is an issue with TestGlobalArtifactPassing and I can see exit code 2 of one container, but I cannot see the logs. Do you have a hint of how to progress?

antoinetran avatar Mar 11 '25 09:03 antoinetran

Hi @Garett-MacGowan , I have an issue with one test with my change. I can see you wrote that test. I have no idea how to debug, I mean there is an issue with TestGlobalArtifactPassing and I can see exit code 2 of one container, but I cannot see the logs. Do you have a hint of how to progress?

Could it be that the workflow container itself no longer has the privileges it needs to write to /tmp/artifact.txt? e.g. argoproj/argosay:v2 doesn't specify USER 8737 and therefore won't have privilege anymore.

Garett-MacGowan avatar Mar 13 '25 23:03 Garett-MacGowan

Could it be that the workflow container itself no longer has the privileges it needs to write to /tmp/artifact.txt? e.g. argoproj/argosay:v2 doesn't specify USER 8737 and therefore won't have privilege anymore.

argoproj/argosay:v2 is running as root by default. In the workflow in failure (https://github.com/argoproj/argo-workflows/blob/main/test/e2e/testdata/global-artifact-passing.yaml#L64), I don't see any user override in the workflow, so unless the CI/CD runs with a default non root user, this container should be run as root. So it shouldn't has issue with writing to /tmp/artifact.txt.

Normally, my modification only impacts these file who uses the MinimalCtrSC function:

  • https://github.com/argoproj/argo-workflows/blob/release-3.6/workflow/controller/agent.go#L173
  • https://github.com/argoproj/argo-workflows/blob/release-3.6/workflow/controller/artifact_gc.go#L448
  • https://github.com/argoproj/argo-workflows/blob/main/workflow/controller/workflowpod.go#L693 which all revolves around the executor image (argoexec), which is precisely why I added UID 8737 by default in its Dockerfile.

How can I be sure that the image used in CI/CD is the one built from this pull request? Can I access the CI/CD environment, do exec inside the container somehow?

antoinetran avatar Mar 14 '25 14:03 antoinetran

How can I be sure that the image used in CI/CD is the one built from this pull request? Can I access the CI/CD environment, do exec inside the container somehow?

The CICD builds the image first then imports to k3d cluster for e2e test (all from current branch)

tczhao avatar Apr 01 '25 15:04 tczhao

How can I be sure that the image used in CI/CD is the one built from this pull request? Can I access the CI/CD environment, do exec inside the container somehow?

The CICD builds the image first then imports to k3d cluster for e2e test (all from current branch)

Thanks for the doc of how to run e2e test locally. I need time to setup the test environment. Probably there won't be movement here until some time, hope I get time!

antoinetran avatar Apr 03 '25 07:04 antoinetran

Ok I could test with github codespace, what an amazing tool!! Could spawn a k3s cluster and reproduce the issue of test TestGlobalArtifactPassing:

/bin/sh: 1: cannot create /tmp/artifact.txt: Permission denied

The main container is run as root. The permission is:

+ ls -al /tmp
total 8
drwxrwxrwt 1 root root 4096 Apr  7 17:35 .
drwxr-xr-x 1 root root 4096 Apr  7 17:35 ..
-rw------- 1 8737 root    0 Apr  7 17:34 artifact.txt

So that means, before the PR, the UID of Argo retrieving the artifact is root. Then, because the owner of artifact.txt is root and permission is 600, then root can write to it. After this PR, the UID of Argo retrieving the artifact is 8737, which is, I believe, good that it is not root. However now the root user cannot modify the file because the group permission is 0!

antoinetran avatar Apr 07 '25 15:04 antoinetran

I tried doing chmod in code for group, so that we have:

-rw-rw-r-- 1 8737 root    0 Apr  7 17:34 artifact.txt

I still get permission denied. After losing a few hairs, I found out this is because of sysctl protected_regular (see here). In /tmp in particular, noone can modify any file, even root, if the owner is not the user. Even if we have 777.

I changed the test workflow so that we get the input in /data/artifact.txt instead of /tmp/artifact.txt and now it works.

The fundamental question for me is: is-it ok if we change the default behavior from

-rw------- 1 root root    0 Apr  7 17:34 artifact.txt

to

-rw-rw-r-- 1 8737 root    0 Apr  7 17:34 artifact.txt

? In most workflows, this will work unless they are using /tmp.

antoinetran avatar Apr 08 '25 15:04 antoinetran

I keep having failure or pending phase for the step onExit at TestStopBehavior test. In github codespace, I ran multiple time the test. Sometime it works, sometime it fails with "failed" phase, sometime with "pending". When I look at the pod logs, nothing is wrong. Exit code 0. I tried changing timeout of the test, no change. I start to believe this test failure has nothing to do with my modification.

@tczhao any idea?

antoinetran avatar Apr 09 '25 11:04 antoinetran

I start to believe this test failure has nothing to do with my modification. Yea, it was due to a flaky test.

tczhao avatar Apr 13 '25 04:04 tczhao