kubectl-flame fix: add containerd support

This is a very simple fix/workaround for allowing profiling on clusters that use containerd as the container runtime.

The user has to provide the "docker-path" as the containerd runtime path.

There are of course more direct approaches to this, but this is the one that requires the least change in the current codebase. (also, the name docker-path becomes a bit meaningless)

tested this new agent image to be working on both dockerd clusters + containerd clusters.

let me know what you think, if necessary i can also adapt the implementation a bit.

closes #69

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

Jul 14 '22 19:07 sbaier1

Great work so far! (I'm not a maintainer of this project – this might not help for getting this PR merged.)

I'm not really into Go, so this might be totally unrelated, but I'm getting an InvalidImageName error when I try your changes.

Here's what I see with on a JVM-based pod.

Events:
  Type     Reason         Age               From     Message
  ----     ------         ----              ----     -------
  Warning  InspectFailed  4s (x7 over 75s)  kubelet  Failed to apply default image tag "verizondigital/kubectl-flame:-jvm": couldn't parse image reference "verizondigital/kubectl-flame:-jvm": invalid reference format
  Warning  Failed         4s (x7 over 75s)  kubelet  Error: InvalidImageName

The problem seems to be related to the leading - in the image tag.

@sbaier1 Do you have an idea what's causing this problem?

Sep 01 '22 18:09 pvorb

I could work around this problem by providing the image path:

kubectl flamecontainerd mypod -t 1m --lang java --image verizondigital/kubectl-flame:v0.2.4-jvm --docker-path /run/containerd

I guess that this problem is not related to this PR.

Sep 01 '22 19:09 pvorb

Now, the log of the tell me the following

{"type":"progress","data":{"time":"2022-09-01T19:10:00.633813759Z","stage":"started"}}
{"type":"error","data":{"reason":"open /var/lib/docker/image/overlay2/layerdb/mounts/containerd://60f8811d44987c163e0392b3bef870b2652b63d3874c5d0f7b3e0f75779d012d/mount-id: no such file or directory"}}

And the pod immediately fails.

@sbaier1 Any ideas?

Sep 01 '22 19:09 pvorb

Now, the log of the tell me the following

{"type":"progress","data":{"time":"2022-09-01T19:10:00.633813759Z","stage":"started"}}
{"type":"error","data":{"reason":"open /var/lib/docker/image/overlay2/layerdb/mounts/containerd://60f8811d44987c163e0392b3bef870b2652b63d3874c5d0f7b3e0f75779d012d/mount-id: no such file or directory"}}

And the pod immediately fails.

@sbaier1 Any ideas?

This is fixed by my proposed change to filesystem.go.

Sep 02 '22 21:09 pvorb

Great to see this issue is being addressed. Is there any reason the pull request hasn't been merged yet?

Oct 31 '22 18:10 benjaminxie

Great to see this issue is being addressed. Is there any reason the pull request hasn't been merged yet?

it seems like this repo currently has no maintainers unfortunately, so no one who can actually merge it is reviewing the PR, so it can't be merged.

i'd be happy to jump back into it if someone maintaining it would respond, right now it seems like this project is just doomed overall

Oct 31 '22 18:10 sbaier1

Very spooky to find an abandoned github repo on Halloween.

This is sad and unfortunate news but thanks for the quick reply.

Oct 31 '22 19:10 benjaminxie

A very spooky Halloween indeed :ghost:

We just upgraded our kubernetes clusters to use the containerd runtime instead of docker. Would love to see this MR merged to get support for this but the project being dead is unfortunate...

The maintainer in the readme is @edeNFed

Nov 01 '22 23:11 QuinnBast

@sbaier1 Well tagging the maintainer worked! But it looks like they also merged it in and the pipeline failed to generate a release :(

Nov 02 '22 16:11 QuinnBast

And my suggestions simply got ignored. :confused:

Nov 02 '22 22:11 pvorb

@QuinnBast which language are you trying to profile? The pipeline did manage to push the container images for JVM, JVM Alpine, BPF and Python. For example, the jvm image was pushed so you could use it with the command

kubectl flame mypod -t 1m --lang java --image verizondigital/kubectl-flame:v0.2.5-jvm --docker-path /run/containerd

(Note the tag, particularly the 5 in v0.2.5 .) If you're lucky, you won't need pvorb's suggested code changes.

This works because the code changes in this merged PR exclusively deal with these container images.

@pvorb It's a shame your suggestions got ignored but if it's any consolation, they were very helpful to me. I used them in my own repo fork, created my own image, finally got the agent pod running. (Too bad the resulting flame graph was empty.)

Nov 03 '22 22:11 benjaminxie

Thanks @benjaminxie! That command worked great. However, when running the command, the command gets stuck:

$ kubectl flame myPod -t 1m --lang java --image myRegistry/library/verizondigital/kubectl-flame:v0.2.5-jvm --docker-path /run/containerd Verifying target pod ... ✔ Launching profiler ... ✔ Profiling ...

The profiler pod does start and I shelled into the pod to find the flame graph at /tmp/flamegraph.svg, however, after copying the file to my local machine it appears that the flamegraph is empty.

Nov 03 '22 23:11 QuinnBast

Yes, exactly. @QuinnBast I've been struggling with problems like these as well, but couldn't solve them by now.

Nov 04 '22 05:11 pvorb

@QuinnBast @pvorb Someone else ran into this issue a while back (no data #73). Let's continue the conversation there. I think I may have some relevant observations.

Nov 04 '22 21:11 benjaminxie

Having an issue with updated image (kubectl-flame:v0.2.5-jvm) to support CONTAINERD. While the initial use stopped the outright failure, I am seeing a new issue to try to launch the flame pod with an exit code of 255. Looks like @pvorb requests to address the following are included:

mountId, err := ioutil.ReadFile(fileName)
if err != nil {
	return "", err
}

But still seeing a 255 exit code ( See below ):

[XXXXXXXXX ~]$ kubectl describe pod kubectl-flame-2e6ee77f-a3d8-4a47-a8f7-4ecd8668abf6-c42wn -n digital Name: kubectl-flame-2e6ee77f-a3d8-4a47-a8f7-4ecd8668abf6-c42wn Namespace: digital Priority: 0 Node: aks-genericdev2-29844296-vmss00000s/10.124.178.217 Start Time: Mon, 12 Dec 2022 15:25:36 -0500 Labels: controller-uid=78458314-1f8c-4fb1-8377-873d7c188388 job-name=kubectl-flame-2e6ee77f-a3d8-4a47-a8f7-4ecd8668abf6 kubectl-flame/id=2e6ee77f-a3d8-4a47-a8f7-4ecd8668abf6 Annotations: sidecar.istio.io/inject: false Status: Failed IP: 10.124.178.227 IPs: IP: 10.124.178.227 Controlled By: Job/kubectl-flame-2e6ee77f-a3d8-4a47-a8f7-4ecd8668abf6 Containers: kubectl-flame: Container ID: containerd://0e85d4002079c1b3e16fde732d8fb161f9bf1b99ba67dcb1c64843531173401c Image: verizondigital/kubectl-flame:v0.2.5-jvm Image ID: docker.io/verizondigital/kubectl-flame@sha256:aa4eb0f6fc0bae768d1c558bca27fd645e8f08e89e91b4c19d891562935bdbfd Port: Host Port: Command: /app/agent Args: 2e6ee77f-a3d8-4a47-a8f7-4ecd8668abf6 d69abd81-7f2f-4f07-abec-ad16aff1ca37 digital-srv-bff-service containerd://0fabb245c0c9ef0788e7edbcd2cde9e3fe2cdd01128a50e2b6fa96d89410294a 1m0s java cpu State: Terminated Reason: Error Exit Code: 1 Started: Mon, 12 Dec 2022 15:25:37 -0500 Finished: Mon, 12 Dec 2022 15:25:37 -0500 Ready: False Restart Count: 0 Environment: Mounts: /var/lib/docker from target-filesystem (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c9sx6 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: target-filesystem: Type: HostPath (bare host directory volume) Path: /run/containerd HostPathType: kube-api-access-c9sx6: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message

Normal Pulling 22s kubelet Pulling image "verizondigital/kubectl-flame:v0.2.5-jvm" Normal Pulled 22s kubelet Successfully pulled image "verizondigital/kubectl-flame:v0.2.5-jvm" in 107.829551ms Normal Created 22s kubelet Created container kubectl-flame Normal Started 22s kubelet Started container kubectl-flame XXXXXXXXXX ~]$ kubectl logs kubectl-flame-2e6ee77f-a3d8-4a47-a8f7-4ecd8668abf6-c42wn -n digital {"type":"progress","data":{"time":"2022-12-12T20:25:37.220351389Z","stage":"started"}} {"type":"error","data":{"reason":"exit status 255"}}

Has anyone resolved this or see the same ?

Running in AKS, with the following runtimes:

System Info: Machine ID: 88e8a329cbd84a22b389208937c90476 System UUID: 3da54c3e-3793-42e5-a949-caef8484533e Boot ID: 6bb95a01-f8c7-4c6b-a66e-70ff03cb2c8d Kernel Version: 5.4.0-1094-azure OS Image: Ubuntu 18.04.6 LTS Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.6.4+azure-4 Kubelet Version: v1.23.12 Kube-Proxy Version: v1.23.12

Dec 12 '22 23:12 chrisc1122

Can this be leveraged to work with crio-d?

Dec 16 '22 23:12 tony-clarke-amdocs