otel-profiling-agent icon indicating copy to clipboard operation
otel-profiling-agent copied to clipboard

HotSpot: Unable to find heap for stub

Open christos68k opened this issue 9 months ago • 3 comments

I saw this in a production deployment while investigating an excessive memory use issue. Agent logs are flooded with:

time="2025-03-17T16:40:21.677891679Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63b10"
time="2025-03-17T16:40:21.677940171Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63b40"
time="2025-03-17T16:40:21.677955179Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63b70"
time="2025-03-17T16:40:21.677967823Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63ba0"
time="2025-03-17T16:40:21.677981542Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63bd0"
time="2025-03-17T16:40:21.677995302Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63be0"
time="2025-03-17T16:40:21.678008012Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63be4"
time="2025-03-17T16:40:21.678026096Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63bf0"
time="2025-03-17T16:40:21.678038494Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63bf4"
time="2025-03-17T16:40:21.678057317Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63c00"
time="2025-03-17T16:40:21.678073211Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63c04"
time="2025-03-17T16:40:21.678085002Z" level=warning msg="Unable to find heap for stub: pid = 3372980, stub.start = 0xffff7fa63c10"
time="2025-03-17T16:40:22.127957624Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3b10"
time="2025-03-17T16:40:22.128003409Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3b40"
time="2025-03-17T16:40:22.128017112Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3b70"
time="2025-03-17T16:40:22.128028862Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3ba0"
time="2025-03-17T16:40:22.128043311Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3bd0"
time="2025-03-17T16:40:22.128055848Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3be0"
time="2025-03-17T16:40:22.128069289Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3be4"
time="2025-03-17T16:40:22.128081252Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3bf0"
time="2025-03-17T16:40:22.128093592Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3bf4"
time="2025-03-17T16:40:22.128108296Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3c00"
time="2025-03-17T16:40:22.128120522Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3c04"
time="2025-03-17T16:40:22.128136448Z" level=warning msg="Unable to find heap for stub: pid = 3373055, stub.start = 0xffffbe9c3c10"

Unfortunately, I couldn't peek at the relevant Pods and extract more JVM information such as version. The same agent was using excessive amounts of memory (~600M) and was an outlier (every other agent in the cluster was below 300M).

The agent was running on an arm64 Node (the code path that produces this warning is only relevant for arm64).

CC: @athre0z

christos68k avatar Mar 20 '25 16:03 christos68k

In this code path we use an unbounded map - maybe this is the reason for the increased memory usage:

https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/0708a7f9517c415ec6967e17535d15569356e80d/interpreter/hotspot/instance.go#L96-L97

florianl avatar Mar 20 '25 16:03 florianl

In this code path we use an unbounded map - maybe this is the reason for the increased memory usage:

I discussed this a little with @fabled today, warrants further investigation.

christos68k avatar Mar 20 '25 16:03 christos68k

In this code path we use an unbounded map - maybe this is the reason for the increased memory usage:

opentelemetry-ebpf-profiler/interpreter/hotspot/instance.go

Lines 96 to 97 in 0708a7f

I believe there is a fairly low, fixed amount of stubs. Those are extracted from the introspection data of the hotspot shared object.

It seems that those stubs are walked periodically. Perhaps they are JITted on demand. But the remaining question is if those can be re-JITed for some reason. @athre0z wrote the code, can you say on top of the head if the stubs can be rejitted? Or are they jitted once at startup, if so why they are periodiocally re-read?

fabled avatar Mar 22 '25 11:03 fabled