otel-profiling-agent
otel-profiling-agent copied to clipboard
only kernel stack frames reported inside orbstack docker container on macOS
i'm using orbstack on an arm64 mac to profile a linux app.
perf works as expected and shows me native symbols from user-land activity
but in devfiler i'm not seeing much come through. it seems connected, there are no errors, but there is very little data. i see some kernel symbols.
how can i debug this further?
but there is very little data
Is the samples timeline and/or the flamegraph empty or are you missing symbols?
If you see frames from your app without further information (symbols missing):
- make sure your app is built with debug symbols
- drag&drop you app into the devfiler window to extract symbols
If that doesn't help, you can enable the "dev" mode by double-clicking on the icon on the left of "devfiler" menu entry. You then see some more menu items. Check if you get gRPC messages and that DB stats show entries for TraceEvents, Stacktraces etc and let us know what you see.
the flamegraph is empty:
im' not seeing much data in the dev menus:
tried with a UTM.app VM and same thing. then i tried the old devfiler 0.6.0 and it started showing a lot more data! it still has those build-id errors, so i will downgrade to 7d2285e14767c7abf4cdbe0927bf7857d1037076 for now
EDIT: the old builds work in the VM but still not inside orbstack. will try inside docker in the VM next
in UTM it works on the host:
Linux ubuntu 6.2.0-39-generic #40-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 23:07:44 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
but inside docker i get:
ERRO[0000] Failed to load eBPF tracer: failed to load eBPF code: your kernel version 6.2.0 is affected by a Linux kernel bug that can lead to system freezes, terminating host agent now to avoid triggering this bug
inside orbstack the kernel is newer:
Linux orbstack 6.10.11-orbstack-00280-g1304bd068592 #21 SMP Sat Sep 21 10:45:28 UTC 2024 aarch64 GNU/Linux
The issue seems specific to running within a container. Only kernel activity is shown. Reproduced via docker exec on Ubuntu VM.
The issue seems specific to running within a container. Only kernel activity is shown. Reproduced via
docker execon Ubuntu VM.
Just to clarify, the profiler is a system-wide profiler, so it requires root privileges. Can you confirm that you run the docker container with --privileged?
Ideally, try running the docker container with something like
docker run --privileged --pid=host -v /etc/machine-id:/etc/machine-id:ro \
-v /var/run/docker.sock:/var/run/docker.sock -v /sys/kernel/debug:/sys/kernel/debug:ro ...
Yes I was using --net=host --privileged=true but I will try those other options
Thanks, with those options it works in docker on Ubuntu. I see all the information from the host too which isn't ideal, but better than nothing.
In orbstack there's still only kernel-data, but I assume that's something specific to that environment.
Thanks, with those options it works in docker on Ubuntu.
Thanks for testing.
I see all the information from the host too which isn't ideal, but better than nothing.
That's exactly what the profiler has been designed for: getting all information from the host while doing continuous profiling. Filtering is assumed to be done on the backend or by the user interface.
But if you think that limiting the view/collection of the profiler is a realistic use case, please open a separate issue with your ideas for discussion.
In orbstack there's still only kernel-data, but I assume that's something specific to that environment.
Maybe someone working on MacOS can chime in here.
Could you please reword the GH issue title and possibly provide profiler logs when starting with -v and/or -bpf-log-level 2?
I meet a similar issue on container running on K8s. There is only kernel stack frames. The base image of the container is ubuntu. The agent can work in a VM environment. In the VM, I am able to see other frames, like Java or Python.
I have grant the privileged to the container.
securityContext:
allowPrivilegeEscalation: true
capabilities:
add:
- CAP_SYS_ADMIN
privileged: true
One thing to note is that the when the container is up and run ebpf-profiler for the first time, it will fail due to the below error
ERRO[0000] Failed to probe tracepoint: failed to get id for tracepoint: failed to read tracepoint ID for sys_enter_mmap: open /sys/kernel/debug/tracing/events/syscalls/sys_enter_mmap/id: no such file or directory
I fixed it by mount debugfs and tracefs
sudo mount -t debugfs none /sys/kernel/debug
sudo mount -t tracefs none /sys/kernel/debug/tracing
I attach the log in attachment with option -v
ebpf-profiler.log
@leonard520 Regarding the "Failed to probe tracepoint": Can you please update to the latest ebpf-profiler? The tracepoint check has been dropped meanwhile.
@tmm1 I just tried the latest main branch. It still has the error "Failed to probe tracepoint". Does it come from here
@tmm1 I just tried the latest
mainbranch. It still has the error "Failed to probe tracepoint". Does it come from here
Sorry, my fault. The code change I was referring to is still in work: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/175
@rockdaboot Do you have any clue why there is only kernel stack frames in my container environment? Feel free to let me know if you want to me to try something.
@leonard520 I assume that for some reason the unwinder runs into an error. Ideally, we could reproduce this somehow on amd64 (can you?). @fabled, maybe you can have look at the above ebpf-profiler.log - I don't find anything in there that helps.
@leonard520 I assume that you use devfiler for visualization. Can you run the profiler with -send-error-frames? I assume that you see the error frames in red directly under the root frame in the flamegraph. It tells you why the unwinding failed, hopefully that is a hint.
@rockdaboot Today I reproduce the issue again for a longer time. I notice there are a lot of log like below
DEBU[0006] Failed to get a cgroupv2 ID as container ID for PID 1006390: open /proc/1006390/cgroup: no such file or directory
I tried to do ps both in container and host worker node. The PID looks like in the host worker node instead of the container itself. As a result, the directory /proc/1006390/cgroup: only exists in node. I am wondering if this is a problem.
I am wondering if this is a problem.
No this is not a problem. I made a test where I changed the code to look into cgroupxxx here, to trigger exactly this error every time. I still see all kinds of frames.
@rockdaboot Thanks for the verification. I did another test in container and record the below things. I am wondering how the profiler list all the PIDs to trace.
- Check the PID for Java process in container
ps aux | grep java
root **289** 9.5 0.1 2598068 101724 pts/2 Sl+ 03:13 0:03 java -Dserver.port=8888 -jar demo.jar
root 313 0.0 0.0 6508 2244 pts/3 S+ 03:13 0:00 grep --color=auto java
- Check the PID for Java process in node
root **1888084** 8.9 0.2 3596600 134912 ? Sl+ 03:13 0:05 java -Dserver.port=8888 -jar demo.jar
- Check PID information in log. I am not able to find PID 289 but only for 1888084, however, for 1888084, I found the below messages. It looks to me that the PID can't be parsed.
DEBU[0056] => PID: 1888084
DEBU[0056] = PID: 1888084
DEBU[0056] - PID: 1888084
DEBU[0056] Skip process exit handling for unknown PID 1888084
DEBU[0057] => PID: 1888084
DEBU[0057] = PID: 1888084
DEBU[0057] - PID: 1888084
DEBU[0057] Skip process exit handling for unknown PID 1888084
DEBU[0058] Failed to get a cgroupv2 ID as container ID for PID 1888084: open /proc/1888084/cgroup: no such file or directory
DEBU[0058] => PID: 1888084
DEBU[0058] = PID: 1888084
DEBU[0058] - PID: 1888084
DEBU[0058] Skip process exit handling for unknown PID 1888084
Attach the full log for reference. ebpf-bad.log
@leonard520 If I understand correctly you're running the profiler in Kubernetes, right?
In that case, you will need to ensure the following is set:
hostPID: true # Setting hostPID to true on the Pod so that the PID namespace is that of the host
containers:
...
securityContext:
runAsUser: 0
privileged: true # Running in privileged mode
procMount: Unmasked # Setting procMount to Unmasked
capabilities:
add:
- SYS_ADMIN # Adding SYS_ADMIN capability
Specifically, setting hostPID: true and procMount: Unmasked should ensure that the PIDs align between the container and the host.
@Gandem I think your answer clarified my confusion. Thank you very much. After trying to add your spec to my pod, I encountered this error.
INFO[0070] eBPF tracer loaded
ERRO[0080] Failed to handle mapping for PID 21720, file /pause: failed to extract interval data: failed to extract stack deltas from /pause: failure to parse golang stack deltas: failed to load .gopclntab section: EOF
INFO[0090] Attached tracer program
INFO[0090] Attached sched monitor
ERRO[0092] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes
ERRO[0096] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes
I am wondering if it is related with the pause container is written in go and I will take a further look.
On the other hand, I’m also considering whether this approach has security risks. Sharing the same PID namespace with the host reduces isolation and increases the potential for container escape.
In orbstack there's still only kernel-data, but I assume that's something specific to that environment.
I hit the same issue with orbstack. I suspect that issue is caused by the fact that:
OrbStack runs full-blown Linux machines that work almost exactly like traditional virtual machines
The word "almost" seems to refer to the fact that the VMs seem to actually be containers of their own. At least this is what I'm looking for signs of this inside of the Orb VM:
$ sudo cat /proc/1/environ
container=lxc
In practice this means that the PIDs seen by the VM are not the same PIDs as seen by the kernel, which breaks the eBPF profiler. I tried working around this by running the profiler via docker run --pid=host ..., but I wasn't able to make it work. I suspect the usage of host PIDs is currently not supported by OrbStack.
Anyway, I ended up firing up a real linux VM in the cloud. If somebody still figures out a way to make OrbStack work, that'd be great, but for now it should probably be considered an unsupported environment.
@Gandem I think your answer clarified my confusion. Thank you very much. After trying to add your spec to my pod, I encountered this error. ERRO[0092] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes ERRO[0096] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes
I am wondering if it is related with the `pause` container is written in go and I will take a further look.
This is seems like OTLP profiling signal breakage (we're making lots of breaking changes). If you use the latest devfiler and compile an agent using 47e8410f9 it should work.
FWIW I just verified today that the profiling agent works inside Docker containers on macOS (both x86 and arm64), as Docker on mac spins up a Linux VM that supports eBPF.
This is because OrbStack runs the Docker engine in a container, so --pid host will still be in a container and the PIDs reported by the eBPF hook will all be wrong from the userspace agent's perspective. bpf_get_current_pid_tgid() always returns PIDs in the root namespace.
To get the current PID in a namespace:
struct task_struct *task = (struct task_struct *)bpf_get_current_task_btf();
unsigned int level = BPF_CORE_READ(task, nsproxy, pid_ns_for_children, level);
pid_t pid = BPF_CORE_READ(task, group_leader, thread_pid, numbers[level].nr);
It may also be possible to support profiling from within containers by using another eBPF program to map namespaced PIDs to host ones.