Git Resolver - git binary not reaping zombie processes
When the git resolver switched to using the git binary, it introduced an issue where every git-based ResolutionRequest results in an orphaned zombie process on the pod. This is caused by git remote-https forking to git-remote-https and orphaning the fork before it completes. Since git clone depends on this forking behavior to clone a repo, and the resolvers binary/image does not have any init process or zombie reaper, after these zombies build up the resolver container runs out of PIDs and is unable to resolve git resolution requests.
The only workaround to get the resolver working again is to restart the pod/container.
There are a couple ways this can be solved and I think it's worth discussing.
- Option 1: Revert the switch from
go-gitto thegitbinary and accept the memory leak- If Option 1 is not chosen, unless this can be fixed quite quickly I believe we should at least put the
gitbinary git-resolver implementation behind a feature flag in the next patch release.
- If Option 1 is not chosen, unless this can be fixed quite quickly I believe we should at least put the
- Option 2: Use an init process such as
tiniin the resolvers image to reap the processes. This does not appear to be possible using ko. - Option 3: Modify the resolvers cmd so that it spawns or doubles as a zombie reaper.
- Go-reaper has one example of how to have the command reap zombies without interfering with the subprocesses in their README
- Option 4: Include a check for this in the resolver's healthcheck - if 4-5 child-processes cannot be created simultaneously then the pod is unhealthy (Since git resolution spawns 4-5 processes and only one of the grandchildren becomes a zombie, there will always be at least 3-4 PIDs available, so you have to spawn half a dozen or so to check for exhaustion)
Expected Behavior
When a git-resolver ResolutionRequest is resolved, it should have no persistent side effects on the resolver container.
Actual Behavior
When a git-resolver ResolutionRequest is resolved, one orphaned zombie process is created. After a large number of these requests are made, the git resolver is unable to resolve any resolutionrequests.
Steps to Reproduce the Problem
- Have access to the nodes for a k8s cluster with Tekton running and the git-resolver enabled (a local kind cluster works)
- On the node which is running the resolvers container/pod, run
ps afux(orps o user,pgid,ppid,pid,command f U <user-id>if the user-id of the container runtime is known) should show the resolvers process with no children. E.g.:
65532 798458 0.1 0.3 2451296 126632 ? Sl Jun13 4:52 /ko-app/resolvers
- Use
kubectl createto create a ResolutionRequest like this:
apiVersion: resolution.tekton.dev/v1beta1
kind: ResolutionRequest
metadata:
labels:
resolution.tekton.dev/type: git
generateName: git-test-zombie-
namespace: default
spec:
params:
- name: url
value: https://github.com/tektoncd/catalog.git
- name: revision
value: main
- name: pathInRepo
value: task/git-clone/0.9/git-clone.yaml
- Use the
pscommand again to observe the resolvers app. Depending on the timing you may see the the childgitprocesses as they're in use:
$ ps fu U 65532
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
65532 59727 0.5 0.4 2245120 137360 ? Ssl 15:54 0:07 /ko-app/resolvers
65532 73989 2.7 0.0 21000 5692 ? Sl 16:17 0:00 \_ git -C /tmp/catalog.git-3627028645 clone https://github.com/tektoncd/catalog.git /tmp/catalog.git-3627028645 --depth=1 --no-checkout
65532 73992 0.0 0.0 12804 4836 ? S 16:17 0:00 \_ /usr/libexec/git-core/git remote-https origin https://github.com/tektoncd/catalog.git
65532 73994 11.1 0.0 88988 10676 ? S 16:17 0:00 | \_ /usr/libexec/git-core/git-remote-https origin https://github.com/tektoncd/catalog.git
65532 74047 16.4 0.0 14308 5908 ? R 16:17 0:00 \_ /usr/libexec/git-core/git --shallow-file /tmp/catalog.git-3627028645/.git/shallow.lock index-pack --stdin --fix-thin --keep=fetch-pack 482 on tekton-pipelines-remote-resolvers-546c458b47-jqbch
However once the resolution request is complete you will see the zombie process created:
$ ps fu U 65532
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
65532 59727 0.5 0.4 2245120 137360 ? Ssl 15:54 0:07 /ko-app/resolvers
65532 73989 2.6 0.0 21000 5820 ? S 16:17 0:00 \_ git -C /tmp/catalog.git-3627028645 clone https://github.com/tektoncd/catalog.git /tmp/catalog.git-3627028645 --depth=1 --no-checkout
65532 73992 0.0 0.0 0 0 ? Z 16:17 0:00 \_ [git] <defunct>
65532 74047 20.2 0.0 440308 6676 ? D 16:17 0:00 \_ /usr/libexec/git-core/git --shallow-file /tmp/catalog.git-3627028645/.git/shallow.lock index-pack --stdin --fix-thin --keep=fetch-pack 482 on tekton-pipelines-remote-resolvers-546c458b47-jqbch
- Note a short time later that the defunct process will be adopted by the
/ko-app/resolversprocess since it has PID 1 on the container and will remain there indefinitely
Additional Info
-
Kubernetes version:
Output of
kubectl version:
$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0
-
Tekton Pipeline version:
Output of
tkn versionorkubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
$ tkn version
Client version: 0.41.0
Pipeline version: v1.0.0
Dashboard version: v0.55.0
CC @vdemeester @waveywaves
Hi @aThorp96 ! Thanks for the great explanation.
I was thinking about Option 2. You can change the base image in ko builds which you did in your PR to add the git binary.
Then one could use a multi stage build with C/C++ for building tini combined with a multi stage build for go for the resolvers and start them both (tini and resolver) in a last stage which could be the chainguard image you're already using.
So the upside to Option 2 is not having to change the code, but the downside is to start using a Dockerfile just because of the git-resolver bug.
Then Option 3 is probably better. The upside is no changes to the base image but the downside is the change to the resolver main.
I am for Opt. 2 or 3.
What do the others think?
Thank @aThorp96 for the issue πΌπΌ I would vote for one of the following :
- option 2. without Dockerfile or multistage built. For this, we could build a "apko tini base image" in
tektoncd/plumbingfor example, and use it for the resolver(s) image - option 3. as is
I think I would choose option 3. personally, but I do like option 2 as well.
- option 2. ... "apko tini base image" in
tektoncd/plumbing
π
Using a base image with something like tini seems like the most pragmatic solution. I can take this on. Is there need to discuss in the working group call tomorrow?
@aThorp96 yes we can discuss that tomorrow, but essentially there is almost everything to do π We can either use dogfooding or github workflows for building the base image (but we already have the mechanisms to build images in dogfooding). We could rely on a Dockerfile or "innovate" with apko, I don't have strong opinion on weither π
"innovate" with apko
I am for "innovate" with apko :)
It seems that TektonHub also experiences many zombie processes for similar reasons.
See https://github.com/tektoncd/plumbing/pull/2690, I went for the simplest fix.