otel-profiling-agent icon indicating copy to clipboard operation
otel-profiling-agent copied to clipboard

Support for running in nested container setups (kind/minikube)

Open patrickpichler opened this issue 9 months ago • 16 comments

Hey,

I am trying to get the profiler working in a kind cluster. After searching through the docs (and Github Issues) I didn't see anything about it being supported or not. The reason I am interested in kind/minikube support is, that it makes it very easy to quickly try out setups in a local environment.

After playing around with it, I now understand it is not supported. The issue is pretty much the same as pointed out by kdrag0n in this comment: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/issues/170#issuecomment-2771201052

Pyroscope was fixing a similar issue, not too long ago in https://github.com/grafana/pyroscope/pull/3008

I was looking through the code to see if there might be an easy fix to add support this and from what I can tell, it should be possible to replace almost all invocations to collect_trace with a PID retrieved from the current process, adjusted for the PID namespace of interest (e.g. the one where the kind cluster is running in).

The only issue from my PoV is the sched_process_free tracepoint (source), as the freed PID will get passed in via the arguments and from what I can tell, the current PID cannot be inferred by getting the current ask struct, as the invocation is delayed.

patrickpichler avatar Jul 16 '25 09:07 patrickpichler

I agree. This may be useful for testing and adoption. This should be doable, but optional and disabled by default and dead code eliminated by the kernel jit when disabled . I could draft a PR if we agree this is something that could be merged.

korniltsev avatar Jul 16 '25 09:07 korniltsev

@patrickpichler can you share the k8 manifest, that you use to deploy the eBPF profiler? E.g. is hostPID set, which capabilities do you use and which volumes do you mount to allow proper access?

Here is a snippet from a k8 manifest that I use and which do not cause such issues:

     spec:
       hostPID: true
       containers:
         - name: otelcol-profiler
           image: <myCostumImage>
           securityContext:
             procMount: "Unmasked"
             privileged: true
             capabilities:
               add:
                 - SYS_ADMIN
           resources:
             limits:
               memory: 500Mi
               cpu: "2"
             requests:
               memory: 500Mi
               cpu: "2"
           env:
             - name: KUBERNETES_NODE_NAME
               valueFrom:
                 fieldRef:
                   fieldPath: spec.nodeName
           volumeMounts:
             - name: debugfs
               mountPath: /sys/kernel/debug
       serviceAccountName: otelcol-profiler
       volumes:
         - name: debugfs
           hostPath:
             path: /sys/kernel/debug
             type: Directory

florianl avatar Jul 16 '25 10:07 florianl

@florianl do you use it in a kind cluster?

korniltsev avatar Jul 16 '25 10:07 korniltsev

do you use it in a kind cluster?

not only in a kind cluster, but also in a kind cluster. but, things might be different for people using a non-linux OS.

florianl avatar Jul 16 '25 10:07 florianl

nice!, last time I tried which was a while ago it did not work for me, but I probably did not have procMount: "Unmasked" which may be the tick required.

korniltsev avatar Jul 16 '25 10:07 korniltsev

Thanks for the swift response!

I did try to run it using (a slightly modified, but nothing that should impact this) version of grafanas: https://github.com/grafana/pyroscope/blob/main/examples/grafana-alloy-auto-instrumentation/ebpf-otel/kubernetes/profiler.yaml

What version of kind are you using? For version v1.33.1 settings procMount: Unmasked only works in combination with enabling the user namespace (which in turn doesn't allow hostPID: true). I think this behavior changed with v1.33 as KEP-4265 was enabled.

I now also spun up a cluster using v1.30.13. There the procMount: Unmasked works without hostUsers: false, but I do not get any data (besides for one containerD cotnainer, for whatever reason, I my best guess, the PID matches something on the host) for services running.

I should have some time later in the day, I could try creating a playground environment in a GH action.

patrickpichler avatar Jul 16 '25 11:07 patrickpichler

Ok I now have setup a test environment in GH actions to quickly test things in the following repo: https://github.com/patrickpichler/otel-ebpf-profiler-kind-playground/

To keep things simpler, I decided to add the opentelemetry-ebpf-profiler repo as a git submodule. The image used in the test will be built from the profiler.Dockerfile (it is pretty much the same Dockerfile in the profiler + two additional stages for building and the resulting image; it is also designed in a way to easily build for multi arch, since I locally run ARM64).

I have a few actions in there:

All three create a cluster and deploy an OTEL-collector + the eBPF profiler and to generate a bit of usage stress is also deployed (you can find all the manifests here). Since I didn't got procMount: "Unmasked" running with Kubernetes v1.33, only the Kind (v1.30.13) config uses it (I decided to go with kustomize for this, so you can find the overlay here).

The OTEL-collector is configured with a file exporter, which writes the result to a hostPath. All of this then runs for 2min, after which the whole namespace gets deleted, the OTEL export file collected and uploaded as an artifact of the run.

Looking at the raw data I can see, that only the Minikube (driver=none) action did collect profiles for the stress container.

If anyone knows a nice tool to transform such OTEL resourceProfile traces into e.g. a flamegraph, I would be more than happy to hear about it 😅

@florianl can you maybe have a look at the setup? I might be doing something wrong here with the way I deploy the eBPF profiler.

patrickpichler avatar Jul 16 '25 17:07 patrickpichler

Just a quick look and no forrow investigation:

  • For a timeframe of 120 seconds, all your actions just produce ~15 reports in their respective otel-event-data. Maybe start to collect the logs of the eBPF profiler as well.
  • I would recommend to look into k8sattributes and enrich resource profiles
  k8sattributes:
    auth_type: "serviceAccount"
    passthrough: false
    filter:
      node_from_env_var: KUBERNETES_NODE_NAME
    extract:
      metadata:
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.deployment.name
        - k8s.namespace.name
        - service.namespace
        - service.name
        - service.version
        - service.instance.id
      labels:
        - tag_name: app.label.component
          key: app.kubernetes.io/component
          from: pod
      otel_annotations: true 
    pod_association:
      - sources:
          - from: resource_attribute
            name: container.id

florianl avatar Jul 17 '25 06:07 florianl

I now added the profiler logs to the archive (and also print them in the action, the step is called Profiler logs). From a quick glimpse (I still need to spend more time to properly understand the full log output, it is quite a lot) all of the containerized installations report a lot of logs looking like:

Skip process exit handling for unknown PID 3757

I also added the k8sattributes processor and from the experiments it seems, that only the [Minikube (driver=none)[https://github.com/patrickpichler/otel-ebpf-profiler-kind-playground/actions/runs/16348474941) got any enriched details. The other do not even contain a single k8s.namespace.name attribute.

patrickpichler avatar Jul 17 '25 15:07 patrickpichler

Make sure that the profiler container can access the host PID namespace, in K8s you typically have to use the hostPID configuration option. If that's not possible, then the agent will not "see" the vast majority of processes running on the node.

christos68k avatar Jul 17 '25 15:07 christos68k

The profiler runs with hostPID: true (see here).

The problem is more that hostPID in the case here is not the root PID namespace of the host, but instead the PID namespace of the kind-control-plane container.

This is also what I suspect the issue is. PIDs from the eBPF code are determined using bpf_get_current_pid_tgid, that returns the PID from the view of there kernel (which is equivalent with the root PID namespace), whereas /proc in the profiler container (even with hostPID: true) see it from the kind-control-plane perspective. Hence the translation as it is done in pyroscope is needed.

Edit: I had a quick check how hostPID: true is implemented in Kubernetes.

The flag gets translated to runtimeapi.NamespaceMode_NODE (link). The CRI source has a nice comment on top of the constant (link):

// A NODE namespace is the namespace of the Kubernetes node. // For example, a container with a PID namespace of NODE expects to view // all of the processes on the host running the kubelet.

Tracing it on containerD's side was a bit more painful. The way I understand is that when NamespaceMode_NODE is set, it will filter out any namespace that is passed on to the low level container runtime (link). From there when the container is started, the unix.CLONE_NEWPID flag is missing, causing it to not create a new PID namespace, effectively running in the one of the host (link and link).

Also it is interesting to see how procMount: Unmasked is implemented. All it effectively does is to not send a list of some predefined paths (link) which would be send in default mode, down to the low level runtime, which then mounts binds mount the paths as read only at the same location (link).

patrickpichler avatar Jul 17 '25 16:07 patrickpichler

You're right, I was confused and didn't realize you were referring to this.

Looks like others have ran into the same issue https://github.com/kubernetes-sigs/kind/issues/3182 and upstream has no plans to address it. @florianl did you verify that it works with kind?

christos68k avatar Jul 17 '25 16:07 christos68k

Without modifying profiler's code we'll have a bunch of other issues with kind even if we use specific K8s images with kind or manage somehow to set procMount in pod spec. In the meantime I tried k0s and it worked pretty well:

  1. Create k0s cluster (https://k0sproject.io/):
    $ curl -sSf https://get.k0s.sh | sudo sh
    $ sudo k0s install controller --single
    $ sudo k0s start
    $ sudo k0s kubectl get nodes
    
    sudo k0s kubeconfig admin >> ~/.kube/config
    
  2. Deploy OpenTelemetry Collector eBPF Profiling Distribution as node agent (DaemonSet) or use your own deployment descriptor:
    $ k apply -f https://raw.githubusercontent.com/danielpacak/opentelemetry-collector-ebpf-profiler/refs/heads/main/example/kubernetes/node-agent.yaml
    
  3. Deploy sample app (e.g. phpMyAdmin):
    $ k apply -f https://raw.githubusercontent.com/danielpacak/vulnerable-kubernetes-   deployments/refs/heads/main/php/phpmyadmin/all.yaml
    
  4. Check the agent's logs (my distro has custom profiles exporter and prints stack frames to stdout):
    $ k logs -n node-agent collector-ebpf-profiler-v7xlg
    2025-07-17T18:41:45.804Z	info	[email protected]/service.go:197	Setting up own telemetry...	{"resource": {"service.instance.id": "74234606-eb68-4767-9fd3-91435a9113a1", "service.name": "otelcol-ebpf-profiler", "service.version": "0.129.0"}}
    2025-07-17T18:41:45.805Z	info	builders/builders.go:26	Development component. May change in the future.	{"resource": {"service.instance.id": "74234606-eb68-4767-9fd3-91435a9113a1", "service.name": "otelcol-ebpf-profiler", "service.version": "0.129.0"}, "otelcol.component.id": "customprofilesexporter", "otelcol.component.kind": "exporter", "otelcol.signal": "profiles"}
    2025-07-17T18:41:45.902Z	info	[email protected]/service.go:257	Starting otelcol-ebpf-profiler...	{"resource": {"service.instance.id": "74234606-eb68-4767-9fd3-91435a9113a1", "service.name": "otelcol-ebpf-profiler", "service.version": "0.129.0"}, "Version": "0.129.0", "NumCPU": 4}
    2025-07-17T18:41:45.902Z	info	extensions/extensions.go:41	Starting extensions...	{"resource": {"service.instance.id": "74234606-eb68-4767-9fd3-91435a9113a1", "service.name": "otelcol-ebpf-profiler", "service.version": "0.129.0"}}
    2025-07-17T18:41:45.902Z	info	[email protected]/exporter.go:25	Starting custom profiles exporter...	{"resource": {"service.instance.id": "74234606-eb68-4767-9fd3-91435a9113a1", "service.name": "otelcol-ebpf-profiler", "service.version": "0.129.0"}, "otelcol.component.id": "customprofilesexporter", "otelcol.component.kind": "exporter", "otelcol.signal": "profiles"}
    time="2025-07-17T18:41:45Z" level=info msg="Interpreter tracers: perl,php,python,hotspot,ruby,v8,dotnet,go,labels"
    time="2025-07-17T18:41:51Z" level=info msg="Found offsets: task stack 0x20, pt_regs 0x3f58, tpbase 0x1528"
    time="2025-07-17T18:41:51Z" level=info msg="Supports generic eBPF map batch operations"
    time="2025-07-17T18:41:51Z" level=info msg="Supports LPM trie eBPF map batch operations"
    time="2025-07-17T18:41:51Z" level=info msg="eBPF tracer loaded"
    [... trimmed ...]
    ------------------- New Sample -------------------
      thread.name: kubelet
      process.executable.name: kubelet
      process.executable.path: /var/lib/k0s/bin/kubelet
      process.pid: 3849
      thread.id: 3853
    ---------------------------------------------------
    Instrumentation: kernel, Function: _raw_spin_unlock_irqrestore, File: , Line: 0, Column: 0
    Instrumentation: kernel, Function: try_to_wake_up, File: , Line: 0, Column: 0
    Instrumentation: kernel, Function: wake_up_q, File: , Line: 0, Column: 0
    Instrumentation: kernel, Function: futex_wake, File: , Line: 0, Column: 0
    Instrumentation: kernel, Function: do_futex, File: , Line: 0, Column: 0
    Instrumentation: kernel, Function: __x64_sys_futex, File: , Line: 0, Column: 0
    Instrumentation: kernel, Function: x64_sys_call, File: , Line: 0, Column: 0
    Instrumentation: kernel, Function: do_syscall_64, File: , Line: 0, Column: 0
    Instrumentation: kernel, Function: entry_SYSCALL_64_after_hwframe, File: , Line: 0, Column: 0
    Instrumentation: go, Function: runtime.futex, File: runtime/sys_linux_amd64.s, Line: 558, Column: 0
    Instrumentation: go, Function: runtime.futexwakeup, File: runtime/os_linux.go, Line: 88, Column: 0
    Instrumentation: go, Function: runtime.notewakeup, File: runtime/lock_futex.go, Line: 33, Column: 0
    Instrumentation: go, Function: runtime.startm, File: runtime/runtime1.go, Line: 614, Column: 0
    Instrumentation: go, Function: runtime.wakep, File: runtime/runtime1.go, Line: 614, Column: 0
    Instrumentation: go, Function: runtime.resetspinning, File: runtime/proc.go, Line: 3886, Column: 0
    Instrumentation: go, Function: runtime.schedule, File: runtime/proc.go, Line: 4063, Column: 0
    Instrumentation: go, Function: runtime.goexit0, File: runtime/proc.go, Line: 4314, Column: 0
    Instrumentation: go, Function: runtime.mcall, File: runtime/asm_amd64.s, Line: 463, Column: 0
    ------------------- End New Sample -------------------
    [... trimmed ...]
    ------------------- New Profile -------------------
    Dropped attributes count 0
    ------------------- New Sample -------------------
      thread.name: apache2
      process.executable.name: apache2
      process.executable.path: /usr/sbin/apache2
      process.pid: 6681
      thread.id: 6681
    ---------------------------------------------------
    Instrumentation: native: Function: 0x4c1a42, File: libphp.so
    Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 74, Column: 0
    Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0
    Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0
    Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0
    Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0
    Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0
    Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0
    Instrumentation: php, Function: traverseForVisitor, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 65, Column: 0
    Instrumentation: php, Function: traverse, File: /var/www/html/vendor/twig/twig/src/NodeTraverser.php, Line: 53, Column: 0
    Instrumentation: php, Function: parse, File: /var/www/html/vendor/twig/twig/src/Parser.php, Line: 108, Column: 0
    Instrumentation: php, Function: parse, File: /var/www/html/vendor/twig/twig/src/Environment.php, Line: 523, Column: 0
    Instrumentation: php, Function: compileSource, File: /var/www/html/vendor/twig/twig/src/Environment.php, Line: 551, Column: 0
    Instrumentation: php, Function: loadTemplate, File: /var/www/html/vendor/twig/twig/src/Environment.php, Line: 381, Column: 0
    Instrumentation: php, Function: load, File: /var/www/html/vendor/twig/twig/src/Environment.php, Line: 343, Column: 0
    Instrumentation: php, Function: load, File: /var/www/html/libraries/classes/Template.php, Line: 123, Column: 0
    Instrumentation: php, Function: render, File: /var/www/html/libraries/classes/Template.php, Line: 156, Column: 0
    Instrumentation: php, Function: render, File: /var/www/html/libraries/classes/Controllers/AbstractController.php, Line: 35, Column: 0
    

danielpacak avatar Jul 17 '25 19:07 danielpacak

I'm not a k8s expert, so I'm not advising and just report my own experience.

When I did set up kind, I think it did the trick around cgroup v2 that is also explained on cilium. Not sure, if this is/was the only thing to get working with kind.

But as pointed out earlier and by kind itself:

kind was primarily designed for testing Kubernetes itself [...]

To work and test applications, like OTel eBPF profiler, in a k8s environment, I would recommend to use minikube and avoid kind.

For minikube i'm using --driver=kvm2.

florianl avatar Jul 18 '25 09:07 florianl

Thanks for the links @florianl, I wasn't aware of the cgroupv2 trick! I do not understand how it addresses the PID namespace issue though. Anyway I need to give it a shot!

Overall it might be a good idea to add some docs about this, since kind is relatively popular in the Kubernetes community.

patrickpichler avatar Jul 21 '25 05:07 patrickpichler

We had similar issues in Inspektor Gadget when the tracer is running in separate, sibling pid namespace to the workloads. It happens in Minikube or WSL, and we can't simply set hostPID: true in those setups. I'm describing below how we solved it, hope it helps for otel-ebpf-profiler.

Converting the tracer pid between ebpf (bpf_get_current_pid_tgid()) and userspace (os.Getpid())

I noticed the following in the code: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/c473c0934511d570b492b34ca00d714ec6a6b7ec/tracer/systemconfig.go#L113

It will initialize the ebpf map with the pid of the tracer. Then, the ebpf code will compare it to bpf_get_current_pid_tgid: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/c473c0934511d570b492b34ca00d714ec6a6b7ec/support/ebpf/system_config.ebpf.c#L71

We resolved it by triggering the ebpf program via a Unix socket and fd passing (SCM_RIGHTS). Instead of comparing the pid between userspace and kernelspace (which we can't safely do because of pid namespaces), we compare the inode of the Unix socket.

  • Userspace: https://github.com/inspektor-gadget/inspektor-gadget/blob/c7ccf393450ce2a2327cd6dd66d8549937636163/pkg/kfilefields/tracer.go#L144
  • Kernelspace: https://github.com/inspektor-gadget/inspektor-gadget/blob/c7ccf393450ce2a2327cd6dd66d8549937636163/pkg/kfilefields/bpf/filefields.bpf.c#L42-L45

This allows us to execute the ebpf program only one time, in the context of the tracer.

Comparing the workload pid between ebpf (bpf_get_current_pid_tgid()) and userspace (/proc/$pid)

For converting the pids of the target workloads, we gather the pids for several layers of pid namespaces in ebpf. And then, we can match them in userspace with the number in /host/proc/$pid:

  • Kernelspace: https://github.com/inspektor-gadget/inspektor-gadget/blob/c7ccf393450ce2a2327cd6dd66d8549937636163/include/gadget/user_stack_map.h#L147-L162
  • Userspace: https://github.com/inspektor-gadget/inspektor-gadget/blob/c7ccf393450ce2a2327cd6dd66d8549937636163/pkg/symbolizer/symbolizer.go#L223-L226

So, even if IG and the target workloads are running in separate, sibling pid namespaces, we can still match the numbers, as long as IG has a /host/proc bind mount that contains the workload processes.

The bind mount does not have to be the top-most pid namespace (which we cannot do in Minikube or WSL). It just has to be high enough to contain the workloads we want to trace.

It does not iterate over pid namespace levels until it finds the correct pid (that would be unsafe if pid numbers collides between pid namespaces), but it checks the correct level by checking the inode of the host pid namespace at /host/proc/1/ns/pid.

alban avatar Sep 02 '25 13:09 alban