otel-profiling-agent only kernel stack frames reported inside orbstack docker container on macOS

i'm using orbstack on an arm64 mac to profile a linux app.

perf works as expected and shows me native symbols from user-land activity

but in devfiler i'm not seeing much come through. it seems connected, there are no errors, but there is very little data. i see some kernel symbols.

how can i debug this further?

Sep 29 '24 12:09 tmm1

but there is very little data

Is the samples timeline and/or the flamegraph empty or are you missing symbols?

If you see frames from your app without further information (symbols missing):

make sure your app is built with debug symbols
drag&drop you app into the devfiler window to extract symbols

If that doesn't help, you can enable the "dev" mode by double-clicking on the icon on the left of "devfiler" menu entry. You then see some more menu items. Check if you get gRPC messages and that DB stats show entries for TraceEvents, Stacktraces etc and let us know what you see.

Sep 30 '24 06:09 rockdaboot

the flamegraph is empty:

im' not seeing much data in the dev menus:

Sep 30 '24 06:09 tmm1

tried with a UTM.app VM and same thing. then i tried the old devfiler 0.6.0 and it started showing a lot more data! it still has those build-id errors, so i will downgrade to 7d2285e14767c7abf4cdbe0927bf7857d1037076 for now

EDIT: the old builds work in the VM but still not inside orbstack. will try inside docker in the VM next

Sep 30 '24 09:09 tmm1

in UTM it works on the host:

Linux ubuntu 6.2.0-39-generic #40-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 23:07:44 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

but inside docker i get:

ERRO[0000] Failed to load eBPF tracer: failed to load eBPF code: your kernel version 6.2.0 is affected by a Linux kernel bug that can lead to system freezes, terminating host agent now to avoid triggering this bug

inside orbstack the kernel is newer:

Linux orbstack 6.10.11-orbstack-00280-g1304bd068592 #21 SMP Sat Sep 21 10:45:28 UTC 2024 aarch64 GNU/Linux

Sep 30 '24 10:09 tmm1

The issue seems specific to running within a container. Only kernel activity is shown. Reproduced via docker exec on Ubuntu VM.

Sep 30 '24 17:09 tmm1

The issue seems specific to running within a container. Only kernel activity is shown. Reproduced via docker exec on Ubuntu VM.

Just to clarify, the profiler is a system-wide profiler, so it requires root privileges. Can you confirm that you run the docker container with --privileged?

Oct 01 '24 05:10 rockdaboot

Ideally, try running the docker container with something like

docker run --privileged --pid=host -v /etc/machine-id:/etc/machine-id:ro \
-v /var/run/docker.sock:/var/run/docker.sock -v /sys/kernel/debug:/sys/kernel/debug:ro ...

Oct 01 '24 05:10 rockdaboot

Yes I was using --net=host --privileged=true but I will try those other options

Oct 01 '24 05:10 tmm1

Thanks, with those options it works in docker on Ubuntu. I see all the information from the host too which isn't ideal, but better than nothing.

In orbstack there's still only kernel-data, but I assume that's something specific to that environment.

Oct 01 '24 06:10 tmm1

Thanks, with those options it works in docker on Ubuntu.

Thanks for testing.

I see all the information from the host too which isn't ideal, but better than nothing.

That's exactly what the profiler has been designed for: getting all information from the host while doing continuous profiling. Filtering is assumed to be done on the backend or by the user interface.

But if you think that limiting the view/collection of the profiler is a realistic use case, please open a separate issue with your ideas for discussion.

In orbstack there's still only kernel-data, but I assume that's something specific to that environment.

Maybe someone working on MacOS can chime in here. Could you please reword the GH issue title and possibly provide profiler logs when starting with -v and/or -bpf-log-level 2?

Oct 01 '24 06:10 rockdaboot

I meet a similar issue on container running on K8s. There is only kernel stack frames. The base image of the container is ubuntu. The agent can work in a VM environment. In the VM, I am able to see other frames, like Java or Python.

I have grant the privileged to the container.

securityContext:
             allowPrivilegeEscalation: true
             capabilities:
               add:
               - CAP_SYS_ADMIN
             privileged: true

One thing to note is that the when the container is up and run ebpf-profiler for the first time, it will fail due to the below error

ERRO[0000] Failed to probe tracepoint: failed to get id for tracepoint: failed to read tracepoint ID for sys_enter_mmap: open /sys/kernel/debug/tracing/events/syscalls/sys_enter_mmap/id: no such file or directory

I fixed it by mount debugfs and tracefs

sudo mount -t debugfs none /sys/kernel/debug
sudo mount -t tracefs none /sys/kernel/debug/tracing

I attach the log in attachment with option -v ebpf-profiler.log

Oct 28 '24 13:10 leonard520

@leonard520 Regarding the "Failed to probe tracepoint": Can you please update to the latest ebpf-profiler? The tracepoint check has been dropped meanwhile.

Oct 28 '24 14:10 rockdaboot

@tmm1 I just tried the latest main branch. It still has the error "Failed to probe tracepoint". Does it come from here

Oct 28 '24 16:10 leonard520

@tmm1 I just tried the latest main branch. It still has the error "Failed to probe tracepoint". Does it come from here

Sorry, my fault. The code change I was referring to is still in work: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/175

Oct 28 '24 16:10 rockdaboot

@rockdaboot Do you have any clue why there is only kernel stack frames in my container environment? Feel free to let me know if you want to me to try something.

Oct 29 '24 02:10 leonard520

@leonard520 I assume that for some reason the unwinder runs into an error. Ideally, we could reproduce this somehow on amd64 (can you?). @fabled, maybe you can have look at the above ebpf-profiler.log - I don't find anything in there that helps.

Oct 29 '24 17:10 rockdaboot

@leonard520 I assume that you use devfiler for visualization. Can you run the profiler with -send-error-frames? I assume that you see the error frames in red directly under the root frame in the flamegraph. It tells you why the unwinding failed, hopefully that is a hint.

Oct 29 '24 17:10 rockdaboot

@rockdaboot Today I reproduce the issue again for a longer time. I notice there are a lot of log like below DEBU[0006] Failed to get a cgroupv2 ID as container ID for PID 1006390: open /proc/1006390/cgroup: no such file or directory

I tried to do ps both in container and host worker node. The PID looks like in the host worker node instead of the container itself. As a result, the directory /proc/1006390/cgroup: only exists in node. I am wondering if this is a problem.

Oct 30 '24 15:10 leonard520

I am wondering if this is a problem.

No this is not a problem. I made a test where I changed the code to look into cgroupxxx here, to trigger exactly this error every time. I still see all kinds of frames.

Oct 30 '24 16:10 rockdaboot

@rockdaboot Thanks for the verification. I did another test in container and record the below things. I am wondering how the profiler list all the PIDs to trace.

Check the PID for Java process in container

ps aux | grep java
root         **289**  9.5  0.1 2598068 101724 pts/2  Sl+  03:13   0:03 java -Dserver.port=8888 -jar demo.jar
root         313  0.0  0.0   6508  2244 pts/3    S+   03:13   0:00 grep --color=auto java

Check the PID for Java process in node

root     **1888084**  8.9  0.2 3596600 134912 ?      Sl+  03:13   0:05 java -Dserver.port=8888 -jar demo.jar

Check PID information in log. I am not able to find PID 289 but only for 1888084, however, for 1888084, I found the below messages. It looks to me that the PID can't be parsed.

DEBU[0056] => PID: 1888084
DEBU[0056] = PID: 1888084
DEBU[0056] - PID: 1888084
DEBU[0056] Skip process exit handling for unknown PID 1888084
DEBU[0057] => PID: 1888084
DEBU[0057] = PID: 1888084
DEBU[0057] - PID: 1888084
DEBU[0057] Skip process exit handling for unknown PID 1888084
DEBU[0058] Failed to get a cgroupv2 ID as container ID for PID 1888084: open /proc/1888084/cgroup: no such file or directory
DEBU[0058] => PID: 1888084
DEBU[0058] = PID: 1888084
DEBU[0058] - PID: 1888084
DEBU[0058] Skip process exit handling for unknown PID 1888084

Attach the full log for reference. ebpf-bad.log

Oct 31 '24 03:10 leonard520

@leonard520 If I understand correctly you're running the profiler in Kubernetes, right?

In that case, you will need to ensure the following is set:

      hostPID: true # Setting hostPID to true on the Pod so that the PID namespace is that of the host
      containers:
        ...
        securityContext:
          runAsUser: 0
          privileged: true # Running in privileged mode
          procMount: Unmasked # Setting procMount to Unmasked
          capabilities:
            add:
            - SYS_ADMIN # Adding SYS_ADMIN capability

Specifically, setting hostPID: true and procMount: Unmasked should ensure that the PIDs align between the container and the host.

Oct 31 '24 08:10 Gandem

@Gandem I think your answer clarified my confusion. Thank you very much. After trying to add your spec to my pod, I encountered this error.

INFO[0070] eBPF tracer loaded
ERRO[0080] Failed to handle mapping for PID 21720, file /pause: failed to extract interval data: failed to extract stack deltas from /pause: failure to parse golang stack deltas: failed to load .gopclntab section: EOF
INFO[0090] Attached tracer program
INFO[0090] Attached sched monitor
ERRO[0092] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes
ERRO[0096] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes

I am wondering if it is related with the pause container is written in go and I will take a further look.

On the other hand, I’m also considering whether this approach has security risks. Sharing the same PID namespace with the host reduces isolation and increases the potential for container escape.

Oct 31 '24 14:10 leonard520

In orbstack there's still only kernel-data, but I assume that's something specific to that environment.

I hit the same issue with orbstack. I suspect that issue is caused by the fact that:

OrbStack runs full-blown Linux machines that work almost exactly like traditional virtual machines

The word "almost" seems to refer to the fact that the VMs seem to actually be containers of their own. At least this is what I'm looking for signs of this inside of the Orb VM:

$ sudo cat /proc/1/environ
container=lxc

In practice this means that the PIDs seen by the VM are not the same PIDs as seen by the kernel, which breaks the eBPF profiler. I tried working around this by running the profiler via docker run --pid=host ..., but I wasn't able to make it work. I suspect the usage of host PIDs is currently not supported by OrbStack.

Anyway, I ended up firing up a real linux VM in the cloud. If somebody still figures out a way to make OrbStack work, that'd be great, but for now it should probably be considered an unsupported environment.

Nov 14 '24 14:11 felixge

@Gandem I think your answer clarified my confusion. Thank you very much. After trying to add your spec to my pod, I encountered this error. ERRO[0092] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes ERRO[0096] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes
I am wondering if it is related with the `pause` container is written in go and I will take a further look.

This is seems like OTLP profiling signal breakage (we're making lots of breaking changes). If you use the latest devfiler and compile an agent using 47e8410f9 it should work.

Nov 15 '24 18:11 christos68k

FWIW I just verified today that the profiling agent works inside Docker containers on macOS (both x86 and arm64), as Docker on mac spins up a Linux VM that supports eBPF.

Apr 01 '25 23:04 christos68k

This is because OrbStack runs the Docker engine in a container, so --pid host will still be in a container and the PIDs reported by the eBPF hook will all be wrong from the userspace agent's perspective. bpf_get_current_pid_tgid() always returns PIDs in the root namespace.

To get the current PID in a namespace:

struct task_struct *task = (struct task_struct *)bpf_get_current_task_btf();
unsigned int level = BPF_CORE_READ(task, nsproxy, pid_ns_for_children, level);
pid_t pid = BPF_CORE_READ(task, group_leader, thread_pid, numbers[level].nr);

It may also be possible to support profiling from within containers by using another eBPF program to map namespaced PIDs to host ones.

Apr 02 '25 02:04 kdrag0n

otel-profiling-agent otel-profiling-agent copied to clipboard

only kernel stack frames reported inside orbstack docker container on macOS

otel-profiling-agent
otel-profiling-agent copied to clipboard