pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

Git Resolver - git binary not reaping zombie processes

Open aThorp96 opened this issue 6 months ago β€’ 7 comments

When the git resolver switched to using the git binary, it introduced an issue where every git-based ResolutionRequest results in an orphaned zombie process on the pod. This is caused by git remote-https forking to git-remote-https and orphaning the fork before it completes. Since git clone depends on this forking behavior to clone a repo, and the resolvers binary/image does not have any init process or zombie reaper, after these zombies build up the resolver container runs out of PIDs and is unable to resolve git resolution requests.

The only workaround to get the resolver working again is to restart the pod/container.

There are a couple ways this can be solved and I think it's worth discussing.

  • Option 1: Revert the switch from go-git to the git binary and accept the memory leak
    • If Option 1 is not chosen, unless this can be fixed quite quickly I believe we should at least put the git binary git-resolver implementation behind a feature flag in the next patch release.
  • Option 2: Use an init process such as tini in the resolvers image to reap the processes. This does not appear to be possible using ko.
  • Option 3: Modify the resolvers cmd so that it spawns or doubles as a zombie reaper.
    • Go-reaper has one example of how to have the command reap zombies without interfering with the subprocesses in their README
  • Option 4: Include a check for this in the resolver's healthcheck - if 4-5 child-processes cannot be created simultaneously then the pod is unhealthy (Since git resolution spawns 4-5 processes and only one of the grandchildren becomes a zombie, there will always be at least 3-4 PIDs available, so you have to spawn half a dozen or so to check for exhaustion)

Expected Behavior

When a git-resolver ResolutionRequest is resolved, it should have no persistent side effects on the resolver container.

Actual Behavior

When a git-resolver ResolutionRequest is resolved, one orphaned zombie process is created. After a large number of these requests are made, the git resolver is unable to resolve any resolutionrequests.

Steps to Reproduce the Problem

  1. Have access to the nodes for a k8s cluster with Tekton running and the git-resolver enabled (a local kind cluster works)
  2. On the node which is running the resolvers container/pod, run ps afux (or ps o user,pgid,ppid,pid,command f U <user-id> if the user-id of the container runtime is known) should show the resolvers process with no children. E.g.:
65532     798458  0.1  0.3 2451296 126632 ?      Sl   Jun13   4:52              /ko-app/resolvers
  1. Use kubectl create to create a ResolutionRequest like this:
apiVersion: resolution.tekton.dev/v1beta1
kind: ResolutionRequest
metadata:
  labels:
    resolution.tekton.dev/type: git
  generateName: git-test-zombie-
  namespace: default
spec:
  params:
  - name: url
    value: https://github.com/tektoncd/catalog.git
  - name: revision
    value: main
  - name: pathInRepo
    value: task/git-clone/0.9/git-clone.yaml
  1. Use the ps command again to observe the resolvers app. Depending on the timing you may see the the child git processes as they're in use:
$  ps fu U 65532
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
65532      59727  0.5  0.4 2245120 137360 ?      Ssl  15:54   0:07 /ko-app/resolvers

65532      73989  2.7  0.0  21000  5692 ?        Sl   16:17   0:00  \_ git -C /tmp/catalog.git-3627028645 clone https://github.com/tektoncd/catalog.git /tmp/catalog.git-3627028645 --depth=1 --no-checkout
65532      73992  0.0  0.0  12804  4836 ?        S    16:17   0:00      \_ /usr/libexec/git-core/git remote-https origin https://github.com/tektoncd/catalog.git
65532      73994 11.1  0.0  88988 10676 ?        S    16:17   0:00      |   \_ /usr/libexec/git-core/git-remote-https origin https://github.com/tektoncd/catalog.git
65532      74047 16.4  0.0  14308  5908 ?        R    16:17   0:00      \_ /usr/libexec/git-core/git --shallow-file /tmp/catalog.git-3627028645/.git/shallow.lock index-pack --stdin --fix-thin --keep=fetch-pack 482 on tekton-pipelines-remote-resolvers-546c458b47-jqbch

However once the resolution request is complete you will see the zombie process created:

$ ps fu U 65532
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
65532      59727  0.5  0.4 2245120 137360 ?      Ssl  15:54   0:07 /ko-app/resolvers
65532      73989  2.6  0.0  21000  5820 ?        S    16:17   0:00  \_ git -C /tmp/catalog.git-3627028645 clone https://github.com/tektoncd/catalog.git /tmp/catalog.git-3627028645 --depth=1 --no-checkout
65532      73992  0.0  0.0      0     0 ?        Z    16:17   0:00      \_ [git] <defunct>
65532      74047 20.2  0.0 440308  6676 ?        D    16:17   0:00      \_ /usr/libexec/git-core/git --shallow-file /tmp/catalog.git-3627028645/.git/shallow.lock index-pack --stdin --fix-thin --keep=fetch-pack 482 on tekton-pipelines-remote-resolvers-546c458b47-jqbch
  1. Note a short time later that the defunct process will be adopted by the /ko-app/resolvers process since it has PID 1 on the container and will remain there indefinitely

Additional Info

  • Kubernetes version:

    Output of kubectl version:

$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

$ tkn version
Client version: 0.41.0
Pipeline version: v1.0.0
Dashboard version: v0.55.0

aThorp96 avatar Jun 16 '25 14:06 aThorp96

CC @vdemeester @waveywaves

aThorp96 avatar Jun 16 '25 15:06 aThorp96

Hi @aThorp96 ! Thanks for the great explanation.

I was thinking about Option 2. You can change the base image in ko builds which you did in your PR to add the git binary.

Then one could use a multi stage build with C/C++ for building tini combined with a multi stage build for go for the resolvers and start them both (tini and resolver) in a last stage which could be the chainguard image you're already using.

So the upside to Option 2 is not having to change the code, but the downside is to start using a Dockerfile just because of the git-resolver bug.

Then Option 3 is probably better. The upside is no changes to the base image but the downside is the change to the resolver main.

I am for Opt. 2 or 3.

What do the others think?

twoGiants avatar Jun 17 '25 09:06 twoGiants

Thank @aThorp96 for the issue πŸ‘ΌπŸΌ I would vote for one of the following :

  • option 2. without Dockerfile or multistage built. For this, we could build a "apko tini base image" in tektoncd/plumbing for example, and use it for the resolver(s) image
  • option 3. as is

I think I would choose option 3. personally, but I do like option 2 as well.

vdemeester avatar Jun 17 '25 10:06 vdemeester

  • option 2. ... "apko tini base image" in tektoncd/plumbing

πŸ‘

twoGiants avatar Jun 17 '25 13:06 twoGiants

Using a base image with something like tini seems like the most pragmatic solution. I can take this on. Is there need to discuss in the working group call tomorrow?

aThorp96 avatar Jun 17 '25 14:06 aThorp96

@aThorp96 yes we can discuss that tomorrow, but essentially there is almost everything to do πŸ˜› We can either use dogfooding or github workflows for building the base image (but we already have the mechanisms to build images in dogfooding). We could rely on a Dockerfile or "innovate" with apko, I don't have strong opinion on weither πŸ˜›

vdemeester avatar Jun 17 '25 16:06 vdemeester

"innovate" with apko

I am for "innovate" with apko :)

twoGiants avatar Jun 17 '25 18:06 twoGiants

It seems that TektonHub also experiences many zombie processes for similar reasons.

l-qing avatar Jun 29 '25 13:06 l-qing

See https://github.com/tektoncd/plumbing/pull/2690, I went for the simplest fix.

vdemeester avatar Jul 16 '25 10:07 vdemeester