hotspot
hotspot copied to clipboard
perf parsing opens the same files over and over again -> `EMFILE`
Describe the bug A perf record of a bunch of processes cannot be exported (directly from the command line to not open anything unnecessary). As non-root hotspot seems to hang after a bunch of
failed to report elf for pid = 693568 : ElfInfo{localFile="/root/.debug/usr/lib64/libgmp.so.10.3.2/b7810a6ea7427180050fb6ab1364903d4f701c9d/elf", isFile=true, originalFileName="libgmp.so.10.3.2", originalPath="/usr/lib64/libgmp.so.10.3.2", addr=7f3a00684000, len=295000, pgoff=0, baseAddr=n/a} : Too many open files
As root those messages are seen over and over again.
To Reproduce
Do a system wide trace, doing something that involves a lot of processes.
run hotspot --exportTo out.perfparser perf.data
Expected behavior Each file is only opened once; if this is not possible then each PID is handled separately (closing everything after the PID was handled; optional with a --save-but-slow option)
Screenshots If applicable, add screenshots to help explain your problem.
Version Info (please complete the following information):
- Linux Kernel version: 4.18.0-513.9.1.el8_9.x86_64
- perf version: 4.18.0-513.18.1.el8_9.x86_64
- hotspot version (appimage? selfcompiled?): hotspot 1.5.80 from appimage
Additional context It seems that the same files are opened multiple times to resolve the symbols. I conclude that because the first PIDs that have libgmp loaded had no problem at all, but after a while l get this error message for each PID in the trace.
this is an inherent limitation of elfutils, we must have a per-pid dwfl process, and each would separately process all encountered elfs. Meaning if you have lots of long lived processes that encounter a ton of elfs, then you may simply run out of file descriptors - I don't see a way to prevent that on our side.
and no, doing per-pid processing or nuking the dwfl's is not an option as that would be far too slow for situations where you have enough file descriptors.
the good news is that elfutils might get some new API for that in the future which would allow us to better reuse data across PIDs and thus drastically reduce the work required: https://sourceware.org/pipermail/elfutils-devel/2024q4/007674.html
the good news is that elfutils might get some new API for that in the future which would allow us to better reuse data across PIDs and thus drastically reduce the work required
That RFC does sound promising - in general; especially as we bundle elfutils and therefore users would have access to this fast.
if you have lots of long lived processes that encounter a ton of elfs, then you may simply run out of file descriptors - I don't see a way to prevent that on our side.
and no, doing per-pid processing or nuking the
dwfl's is not an option as that would be far too slow for situations where you have enough file descriptors.
I see the point but there can be another conclusion:
- the current default should not be changed - because it works in our current scenario and will work much better if/when the RFC (which is currently still in its design, per last Dec 2024 notes) has made it to a working version in elfutils and perfparser was adjusted to make use of this and we use the appimage or a user has the most current elfutils available during build himself
- because we know for sure that with current elfutils the current implementation will fail with file descriptors:
- it would be good to stop processing when perfparser receives a threshold of errors (ideally scoped to ENOMEM/EMFILE -> non-recoverable), because then we get the same error over and over again for all further processing [and may be able to use the stuff already parsed as well]); I've terminated the process after possibly ~5-10 minutes of flooded error messages in the terminal -> should this be a separate FR (or even bug report)?
- an optional
--dwfl-per-pid/--minimal-memoryoption "free resources as fast as possible - very slow but can help with EMFILE/ENOMEM during parsing" would be good - the filtering per PID #524 has a big reason more (because filtering is better than "crashing")