Support for running in nested container setups (kind/minikube)
Hey,
I am trying to get the profiler working in a kind cluster. After searching through the docs (and Github Issues) I didn't see anything about it being supported or not. The reason I am interested in kind/minikube support is, that it makes it very easy to quickly try out setups in a local environment.
After playing around with it, I now understand it is not supported. The issue is pretty much the same as pointed out by kdrag0n in this comment: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/issues/170#issuecomment-2771201052
Pyroscope was fixing a similar issue, not too long ago in https://github.com/grafana/pyroscope/pull/3008
I was looking through the code to see if there might be an easy fix to add support this and from what I can tell, it should be possible to replace almost all invocations to collect_trace with a PID retrieved from the current process, adjusted for the PID namespace of interest (e.g. the one where the kind cluster is running in).
The only issue from my PoV is the sched_process_free tracepoint (source), as the freed PID will get passed in via the arguments and from what I can tell, the current PID cannot be inferred by getting the current ask struct, as the invocation is delayed.
I agree. This may be useful for testing and adoption. This should be doable, but optional and disabled by default and dead code eliminated by the kernel jit when disabled . I could draft a PR if we agree this is something that could be merged.
@patrickpichler can you share the k8 manifest, that you use to deploy the eBPF profiler?
E.g. is hostPID set, which capabilities do you use and which volumes do you mount to allow proper access?
Here is a snippet from a k8 manifest that I use and which do not cause such issues:
spec:
hostPID: true
containers:
- name: otelcol-profiler
image: <myCostumImage>
securityContext:
procMount: "Unmasked"
privileged: true
capabilities:
add:
- SYS_ADMIN
resources:
limits:
memory: 500Mi
cpu: "2"
requests:
memory: 500Mi
cpu: "2"
env:
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: debugfs
mountPath: /sys/kernel/debug
serviceAccountName: otelcol-profiler
volumes:
- name: debugfs
hostPath:
path: /sys/kernel/debug
type: Directory
@florianl do you use it in a kind cluster?
do you use it in a kind cluster?
not only in a kind cluster, but also in a kind cluster. but, things might be different for people using a non-linux OS.
nice!, last time I tried which was a while ago it did not work for me, but I probably did not have procMount: "Unmasked" which may be the tick required.
Thanks for the swift response!
I did try to run it using (a slightly modified, but nothing that should impact this) version of grafanas: https://github.com/grafana/pyroscope/blob/main/examples/grafana-alloy-auto-instrumentation/ebpf-otel/kubernetes/profiler.yaml
What version of kind are you using? For version v1.33.1 settings procMount: Unmasked only works in combination with enabling the user namespace (which in turn doesn't allow hostPID: true). I think this behavior changed with v1.33 as KEP-4265 was enabled.
I now also spun up a cluster using v1.30.13. There the procMount: Unmasked works without hostUsers: false, but I do not get any data (besides for one containerD cotnainer, for whatever reason, I my best guess, the PID matches something on the host) for services running.
I should have some time later in the day, I could try creating a playground environment in a GH action.
Ok I now have setup a test environment in GH actions to quickly test things in the following repo: https://github.com/patrickpichler/otel-ebpf-profiler-kind-playground/
To keep things simpler, I decided to add the opentelemetry-ebpf-profiler repo as a git submodule. The image used in the test will be built from the profiler.Dockerfile (it is pretty much the same Dockerfile in the profiler + two additional stages for building and the resulting image; it is also designed in a way to easily build for multi arch, since I locally run ARM64).
I have a few actions in there:
All three create a cluster and deploy an OTEL-collector + the eBPF profiler and to generate a bit of usage stress is also deployed (you can find all the manifests here). Since I didn't got procMount: "Unmasked" running with Kubernetes v1.33, only the Kind (v1.30.13) config uses it (I decided to go with kustomize for this, so you can find the overlay here).
The OTEL-collector is configured with a file exporter, which writes the result to a hostPath. All of this then runs for 2min, after which the whole namespace gets deleted, the OTEL export file collected and uploaded as an artifact of the run.
Looking at the raw data I can see, that only the Minikube (driver=none) action did collect profiles for the stress container.
If anyone knows a nice tool to transform such OTEL resourceProfile traces into e.g. a flamegraph, I would be more than happy to hear about it 😅
@florianl can you maybe have a look at the setup? I might be doing something wrong here with the way I deploy the eBPF profiler.
Just a quick look and no forrow investigation:
- For a timeframe of 120 seconds, all your actions just produce ~15 reports in their respective otel-event-data. Maybe start to collect the logs of the eBPF profiler as well.
- I would recommend to look into
k8sattributesand enrich resource profiles
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
filter:
node_from_env_var: KUBERNETES_NODE_NAME
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.namespace.name
- service.namespace
- service.name
- service.version
- service.instance.id
labels:
- tag_name: app.label.component
key: app.kubernetes.io/component
from: pod
otel_annotations: true
pod_association:
- sources:
- from: resource_attribute
name: container.id
I now added the profiler logs to the archive (and also print them in the action, the step is called Profiler logs). From a quick glimpse (I still need to spend more time to properly understand the full log output, it is quite a lot) all of the containerized installations report a lot of logs looking like:
Skip process exit handling for unknown PID 3757
I also added the k8sattributes processor and from the experiments it seems, that only the [Minikube (driver=none)[https://github.com/patrickpichler/otel-ebpf-profiler-kind-playground/actions/runs/16348474941) got any enriched details. The other do not even contain a single k8s.namespace.name attribute.
Make sure that the profiler container can access the host PID namespace, in K8s you typically have to use the hostPID configuration option. If that's not possible, then the agent will not "see" the vast majority of processes running on the node.
The profiler runs with hostPID: true (see here).
The problem is more that hostPID in the case here is not the root PID namespace of the host, but instead the PID namespace of the kind-control-plane container.
This is also what I suspect the issue is. PIDs from the eBPF code are determined using bpf_get_current_pid_tgid, that returns the PID from the view of there kernel (which is equivalent with the root PID namespace), whereas /proc in the profiler container (even with hostPID: true) see it from the kind-control-plane perspective. Hence the translation as it is done in pyroscope is needed.
Edit: I had a quick check how hostPID: true is implemented in Kubernetes.
The flag gets translated to runtimeapi.NamespaceMode_NODE (link). The CRI source has a nice comment on top of the constant (link):
// A NODE namespace is the namespace of the Kubernetes node. // For example, a container with a PID namespace of NODE expects to view // all of the processes on the host running the kubelet.
Tracing it on containerD's side was a bit more painful. The way I understand is that when NamespaceMode_NODE is set, it will filter out any namespace that is passed on to the low level container runtime (link). From there when the container is started, the unix.CLONE_NEWPID flag is missing, causing it to not create a new PID namespace, effectively running in the one of the host (link and link).
Also it is interesting to see how procMount: Unmasked is implemented. All it effectively does is to not send a list of some predefined paths (link) which would be send in default mode, down to the low level runtime, which then mounts binds mount the paths as read only at the same location (link).
You're right, I was confused and didn't realize you were referring to this.
Looks like others have ran into the same issue https://github.com/kubernetes-sigs/kind/issues/3182 and upstream has no plans to address it. @florianl did you verify that it works with kind?
Without modifying profiler's code we'll have a bunch of other issues with kind even if we use specific K8s images with kind or manage somehow to set procMount in pod spec. In the meantime I tried k0s and it worked pretty well:
- Create k0s cluster (https://k0sproject.io/):
$ curl -sSf https://get.k0s.sh | sudo sh $ sudo k0s install controller --single $ sudo k0s start $ sudo k0s kubectl get nodessudo k0s kubeconfig admin >> ~/.kube/config - Deploy OpenTelemetry Collector eBPF Profiling Distribution as node agent (DaemonSet) or use your own deployment descriptor:
$ k apply -f https://raw.githubusercontent.com/danielpacak/opentelemetry-collector-ebpf-profiler/refs/heads/main/example/kubernetes/node-agent.yaml - Deploy sample app (e.g. phpMyAdmin):
$ k apply -f https://raw.githubusercontent.com/danielpacak/vulnerable-kubernetes- deployments/refs/heads/main/php/phpmyadmin/all.yaml - Check the agent's logs (my distro has custom profiles exporter and prints stack frames to stdout):
$ k logs -n node-agent collector-ebpf-profiler-v7xlg 2025-07-17T18:41:45.804Z info [email protected]/service.go:197 Setting up own telemetry... {"resource": {"service.instance.id": "74234606-eb68-4767-9fd3-91435a9113a1", "service.name": "otelcol-ebpf-profiler", "service.version": "0.129.0"}} 2025-07-17T18:41:45.805Z info builders/builders.go:26 Development component. May change in the future. {"resource": {"service.instance.id": "74234606-eb68-4767-9fd3-91435a9113a1", "service.name": "otelcol-ebpf-profiler", "service.version": "0.129.0"}, "otelcol.component.id": "customprofilesexporter", "otelcol.component.kind": "exporter", "otelcol.signal": "profiles"} 2025-07-17T18:41:45.902Z info [email protected]/service.go:257 Starting otelcol-ebpf-profiler... {"resource": {"service.instance.id": "74234606-eb68-4767-9fd3-91435a9113a1", "service.name": "otelcol-ebpf-profiler", "service.version": "0.129.0"}, "Version": "0.129.0", "NumCPU": 4} 2025-07-17T18:41:45.902Z info extensions/extensions.go:41 Starting extensions... {"resource": {"service.instance.id": "74234606-eb68-4767-9fd3-91435a9113a1", "service.name": "otelcol-ebpf-profiler", "service.version": "0.129.0"}} 2025-07-17T18:41:45.902Z info [email protected]/exporter.go:25 Starting custom profiles exporter... {"resource": {"service.instance.id": "74234606-eb68-4767-9fd3-91435a9113a1", "service.name": "otelcol-ebpf-profiler", "service.version": "0.129.0"}, "otelcol.component.id": "customprofilesexporter", "otelcol.component.kind": "exporter", "otelcol.signal": "profiles"} time="2025-07-17T18:41:45Z" level=info msg="Interpreter tracers: perl,php,python,hotspot,ruby,v8,dotnet,go,labels" time="2025-07-17T18:41:51Z" level=info msg="Found offsets: task stack 0x20, pt_regs 0x3f58, tpbase 0x1528" time="2025-07-17T18:41:51Z" level=info msg="Supports generic eBPF map batch operations" time="2025-07-17T18:41:51Z" level=info msg="Supports LPM trie eBPF map batch operations" time="2025-07-17T18:41:51Z" level=info msg="eBPF tracer loaded" [... trimmed ...] ------------------- New Sample ------------------- thread.name: kubelet process.executable.name: kubelet process.executable.path: /var/lib/k0s/bin/kubelet process.pid: 3849 thread.id: 3853 --------------------------------------------------- Instrumentation: kernel, Function: _raw_spin_unlock_irqrestore, File: , Line: 0, Column: 0 Instrumentation: kernel, Function: try_to_wake_up, File: , Line: 0, Column: 0 Instrumentation: kernel, Function: wake_up_q, File: , Line: 0, Column: 0 Instrumentation: kernel, Function: futex_wake, File: , Line: 0, Column: 0 Instrumentation: kernel, Function: do_futex, File: , Line: 0, Column: 0 Instrumentation: kernel, Function: __x64_sys_futex, File: , Line: 0, Column: 0 Instrumentation: kernel, Function: x64_sys_call, File: , Line: 0, Column: 0 Instrumentation: kernel, Function: do_syscall_64, File: , Line: 0, Column: 0 Instrumentation: kernel, Function: entry_SYSCALL_64_after_hwframe, File: , Line: 0, Column: 0 Instrumentation: go, Function: runtime.futex, File: runtime/sys_linux_amd64.s, Line: 558, Column: 0 Instrumentation: go, Function: runtime.futexwakeup, File: runtime/os_linux.go, Line: 88, Column: 0 Instrumentation: go, Function: runtime.notewakeup, File: runtime/lock_futex.go, Line: 33, Column: 0 Instrumentation: go, Function: runtime.startm, File: runtime/runtime1.go, Line: 614, Column: 0 Instrumentation: go, Function: runtime.wakep, File: runtime/runtime1.go, Line: 614, Column: 0 Instrumentation: go, Function: runtime.resetspinning, File: runtime/proc.go, Line: 3886, Column: 0 Instrumentation: go, Function: runtime.schedule, File: runtime/proc.go, Line: 4063, Column: 0 Instrumentation: go, Function: runtime.goexit0, File: runtime/proc.go, Line: 4314, Column: 0 Instrumentation: go, Function: runtime.mcall, File: runtime/asm_amd64.s, Line: 463, Column: 0 ------------------- End New Sample ------------------- [... trimmed ...] ------------------- New Profile ------------------- Dropped attributes count 0 ------------------- New Sample ------------------- thread.name: apache2 process.executable.name: apache2 process.executable.path: /usr/sbin/apache2 process.pid: 6681 thread.id: 6681 --------------------------------------------------- Instrumentation: native: Function: 0x4c1a42, File: libphp.so Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 74, Column: 0 Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0 Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0 Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0 Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0 Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0 Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0 Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0 Instrumentation: php, Function: traverse, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 53, Column: 0 Instrumentation: php, Function: parse, File: /var/www/html/vendor/twig/twig/src/Parser.php, Line: 108, Column: 0 Instrumentation: php, Function: parse, File: /var/www/html/vendor/twig/twig/src/Environment.php, Line: 523, Column: 0 Instrumentation: php, Function: compileSource, File: /var/www/html/vendor/twig/twig/src/Environment.php, Line: 551, Column: 0 Instrumentation: php, Function: loadTemplate, File: /var/www/html/vendor/twig/twig/src/Environment.php, Line: 381, Column: 0 Instrumentation: php, Function: load, File: /var/www/html/vendor/twig/twig/src/Environment.php, Line: 343, Column: 0 Instrumentation: php, Function: load, File: /var/www/html/libraries/classes/Template.php, Line: 123, Column: 0 Instrumentation: php, Function: render, File: /var/www/html/libraries/classes/Template.php, Line: 156, Column: 0 Instrumentation: php, Function: render, File: /var/www/html/libraries/classes/Controllers/AbstractController.php, Line: 35, Column: 0
I'm not a k8s expert, so I'm not advising and just report my own experience.
When I did set up kind, I think it did the trick around cgroup v2 that is also explained on cilium. Not sure, if this is/was the only thing to get working with kind.
But as pointed out earlier and by kind itself:
kind was primarily designed for testing Kubernetes itself [...]
To work and test applications, like OTel eBPF profiler, in a k8s environment, I would recommend to use minikube and avoid kind.
For minikube i'm using --driver=kvm2.
Thanks for the links @florianl, I wasn't aware of the cgroupv2 trick! I do not understand how it addresses the PID namespace issue though. Anyway I need to give it a shot!
Overall it might be a good idea to add some docs about this, since kind is relatively popular in the Kubernetes community.
We had similar issues in Inspektor Gadget when the tracer is running in separate, sibling pid namespace to the workloads. It happens in Minikube or WSL, and we can't simply set hostPID: true in those setups. I'm describing below how we solved it, hope it helps for otel-ebpf-profiler.
Converting the tracer pid between ebpf (bpf_get_current_pid_tgid()) and userspace (os.Getpid())
I noticed the following in the code: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/c473c0934511d570b492b34ca00d714ec6a6b7ec/tracer/systemconfig.go#L113
It will initialize the ebpf map with the pid of the tracer. Then, the ebpf code will compare it to bpf_get_current_pid_tgid:
https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/c473c0934511d570b492b34ca00d714ec6a6b7ec/support/ebpf/system_config.ebpf.c#L71
We resolved it by triggering the ebpf program via a Unix socket and fd passing (SCM_RIGHTS). Instead of comparing the pid between userspace and kernelspace (which we can't safely do because of pid namespaces), we compare the inode of the Unix socket.
- Userspace: https://github.com/inspektor-gadget/inspektor-gadget/blob/c7ccf393450ce2a2327cd6dd66d8549937636163/pkg/kfilefields/tracer.go#L144
- Kernelspace: https://github.com/inspektor-gadget/inspektor-gadget/blob/c7ccf393450ce2a2327cd6dd66d8549937636163/pkg/kfilefields/bpf/filefields.bpf.c#L42-L45
This allows us to execute the ebpf program only one time, in the context of the tracer.
Comparing the workload pid between ebpf (bpf_get_current_pid_tgid()) and userspace (/proc/$pid)
For converting the pids of the target workloads, we gather the pids for several layers of pid namespaces in ebpf. And then, we can match them in userspace with the number in /host/proc/$pid:
- Kernelspace: https://github.com/inspektor-gadget/inspektor-gadget/blob/c7ccf393450ce2a2327cd6dd66d8549937636163/include/gadget/user_stack_map.h#L147-L162
- Userspace: https://github.com/inspektor-gadget/inspektor-gadget/blob/c7ccf393450ce2a2327cd6dd66d8549937636163/pkg/symbolizer/symbolizer.go#L223-L226
So, even if IG and the target workloads are running in separate, sibling pid namespaces, we can still match the numbers, as long as IG has a /host/proc bind mount that contains the workload processes.
The bind mount does not have to be the top-most pid namespace (which we cannot do in Minikube or WSL). It just has to be high enough to contain the workloads we want to trace.
It does not iterate over pid namespace levels until it finds the correct pid (that would be unsafe if pid numbers collides between pid namespaces), but it checks the correct level by checking the inode of the host pid namespace at /host/proc/1/ns/pid.