ddprof
ddprof copied to clipboard
PID mode not working with Erlang beam
I am using Elixir. I have export ERL_FLAGS="+JPperf map" and a perf map file is present in /tmp
The following works:
sudo ./bin/ddprof --environment test --service test-service --preset cpu_only -l debug iex
but starting iex separately, and running
sudo ./bin/ddprof --environment test --service test-service --preset cpu_only -l debug -p $(pidof beam.smp)
does not.
<NOTICE>Aug 20 12:27:51.522223 ddprof[6194]: datadog.profiling.native.event.count: 0
<NOTICE>Aug 20 12:27:51.522230 ddprof[6194]: datadog.profiling.native.event.lost: 0
<NOTICE>Aug 20 12:27:51.522237 ddprof[6194]: datadog.profiling.native.sample.count: 0
<NOTICE>Aug 20 12:27:51.522243 ddprof[6194]: datadog.profiling.native.target_process.cpu_usage.millicores: 0
<NOTICE>Aug 20 12:27:51.522249 ddprof[6194]: datadog.profiling.native.unwind.avg_time_ns: -1
<NOTICE>Aug 20 12:27:51.522254 ddprof[6194]: datadog.profiling.native.unwind.frames: 0
<NOTICE>Aug 20 12:27:51.522260 ddprof[6194]: datadog.profiling.native.unwind.errors: 0
<NOTICE>Aug 20 12:27:51.522265 ddprof[6194]: datadog.profiling.native.unwind.stack.truncated_input: 0
<NOTICE>Aug 20 12:27:51.522271 ddprof[6194]: datadog.profiling.native.unwind.stack.truncated_output: 0
<NOTICE>Aug 20 12:27:51.522278 ddprof[6194]: datadog.profiling.native.unwind.stack.incomplete: 0
<NOTICE>Aug 20 12:27:51.522285 ddprof[6194]: datadog.profiling.native.unwind.stack.avg_size: -1
<NOTICE>Aug 20 12:27:51.522291 ddprof[6194]: datadog.profiling.native.unwind.stack.avg_depth: -1
<NOTICE>Aug 20 12:27:51.522297 ddprof[6194]: datadog.profiling.native.symbols.jit.reads: 0
<NOTICE>Aug 20 12:27:51.522303 ddprof[6194]: datadog.profiling.native.symbols.jit.failed_lookups: 0
<NOTICE>Aug 20 12:27:51.522308 ddprof[6194]: datadog.profiling.native.symbols.jit.symbol_count: 0
<NOTICE>Aug 20 12:27:51.522314 ddprof[6194]: datadog.profiling.native.profiler.rss: 3399680
<NOTICE>Aug 20 12:27:51.522320 ddprof[6194]: datadog.profiling.native.profiler.cpu_usage.millicores: 1
<NOTICE>Aug 20 12:27:51.522326 ddprof[6194]: datadog.profiling.native.dso.unhandled_sections: 0
<NOTICE>Aug 20 12:27:51.522330 ddprof[6194]: datadog.profiling.native.dso.new: 0
<NOTICE>Aug 20 12:27:51.522335 ddprof[6194]: datadog.profiling.native.dso.size: 0
<NOTICE>Aug 20 12:27:51.522341 ddprof[6194]: datadog.profiling.native.pprof.size: 0
<NOTICE>Aug 20 12:27:51.522347 ddprof[6194]: datadog.profiling.native.profile.duration_ms: 59001
<NOTICE>Aug 20 12:27:51.522353 ddprof[6194]: datadog.profiling.native.aggregation.avg_time_ns: -1
<NOTICE>Aug 20 12:27:51.522357 ddprof[6194]: No statsd socket provided
<NOTICE>Aug 20 12:27:51.522360 ddprof[6194]: DWFL_SYMB NO CALLS
<NOTICE>Aug 20 12:27:51.522363 ddprof[6194]: DSO_SYMB | SIZE | 0
<NOTICE>Aug 20 12:27:51.522370 ddprof[6194]: DWFL_HDR | NB MODS | 0
<NOTICE>Aug 20 12:27:51.522580 ddprof[6194]: [EXPORTER] Export buffer of size 121
I'm unsure if this is the same issue as the other PID related issues, as in this case nothing shows up at all.
It might be a bit of a similar case. We run Julia and we are using the global profiling option on the node (since the we have hit another bug in the pid code (https://github.com/DataDog/ddprof/issues/283). In the global mode the frames from the jitted functions are either missing or they appear as anonymous. You can check the last comments on https://github.com/DataDog/ddprof/issues/276
If the first command is working, it should be a dupe of this issue: https://github.com/DataDog/ddprof/issues/283 It should not relate to #276 which now seems to be an issue with JITDump. In this case we are using perfmaps.
To help us prioritize this, is the wrapper workaround (first command) OK to use ? Or do you need PID mode ?
I wasn't not sure whether the issue I was seeing was the same, because in my case no data at all shows up, where in the other issues it looks like at least something shows up.
To help us prioritize this, is the wrapper workaround (first command) OK to use ? Or do you need PID mode ?
For not entirely rational reasons, PID mode would be required for me to convince people to try this out, as then there is no "scary wrapper" required (and requires no modification of the existing methods to deploy services).
The bug is that we only follow the initial TID. If it happens that nothing happens in that given TID, we won't be showing anything.
Erlang might not be easy to visualize through call stacks. With the working wrapper command, were the graphs helpful ? Was the activity understandable ?
I only ran a simple web server through it, but I would say they were reasonably understandable.
Here is a screenshot of a profile:
Due to how Erlang processes work profiling might not be as easy to work with as some other languages, but spawn_link definitely has meaning to an Erlang/Elixir developer.
Over at https://github.com/parca-dev/parca-agent/issues/145#issuecomment-1453134906 someone has been able to get application function calls as well.
Thanks for the screen shot. The risk I see is that a lot of activity will relate to the framework and not the user activity, like scheduling logics (example: epoll and sched_yield stacks). Tracing might have a more intuitive approach.
Could I trouble you to push the simple web server to a public repo (in a docker?) so I can run it on my side and check how / where the perf maps are published?
The Parca Dockerfile from their example should work I think, if not let me know and I will create something.
You are right that the trace above shows a lot of epoll etc. However I also found https://blog.rabbitmq.com/posts/2022/05/flame-graphs/ which shows that you can get a graph of the application code with perf. You are not using perf of course, but the same data should be available right?
Thanks, I tried it in wrapper mode. I am seeing stacks with JIT information from erlang. I am not sure how helpful this would be to users.
In --global mode, we do not match the correct filename (due to a pid namespace issue).
For wholehost: the fix is to retrieve the namespaced PID and to look for the perf map with the correct name.