coroot-node-agent icon indicating copy to clipboard operation
coroot-node-agent copied to clipboard

Possible JVM monitoring issue after updating to 1.23.19

Open T100D opened this issue 8 months ago • 7 comments

No issues in the web interface of Coroot.

Coroot: 1.10.2 Coroot-node-agent: 1.23.19 Prometheus: 2.53.4 clickhouse: 25.4.1.2934

non-docker installation Linux 4.18.0-553.50.1.el8_10.x86_64 https://github.com/coroot/coroot-node-agent/pull/1 SMP Wed Apr 16 11:36:26 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux (Rocky Linux 8.x)

program = elasticsearch on java 15.0.1 (as seen in coroot) [root@montst bin]# java -version openjdk version "1.8.0_452" OpenJDK Runtime Environment (build 1.8.0_452-b09) OpenJDK 64-Bit Server VM (build 25.452-b09, mixed mode)

(Graylog in #199 is running on another and newer java version)

/var/log/messages

Apr 24 17:25:41 montst coroot-node-agent[13352]: I0424 17:25:41.491141   13352 profiling.go:245] JVM detected PID: 13561, perfmap dump supported: true
Apr 24 17:25:41 montst coroot-node-agent[13352]: W0424 17:25:41.512427   13352 profiling.go:255] failed to dump perfmap of JVM 13561: status:-
Apr 24 17:25:41 montst coroot-node-agent[13352]: I0424 17:25:41.639650   13352 profiling.go:139] collected 6 profiles in 149ms
Apr 24 17:25:41 montst coroot-node-agent[13352]: I0424 17:25:41.661209   13352 profiling.go:149] uploaded 6 profiles in 21ms

cat /proc/13561/status

Name:   java
Umask:  0022
State:  S (sleeping)
Tgid:   13561
Ngid:   0
Pid:    13561
PPid:   1
TracerPid:      0
Uid:    990     990     990     990
Gid:    986     986     986     986
FDSize: 1024
Groups: 986
NStgid: 13561
NSpid:  13561
NSpgid: 13561
NSsid:  13561
VmPeak: 84787288 kB
VmSize: 84760180 kB
VmLck:   7559844 kB
VmPin:         0 kB
VmHWM:   5446848 kB
VmRSS:   5446848 kB
RssAnon:         4886256 kB
RssFile:          560592 kB
RssShmem:              0 kB
VmData:  4989328 kB
VmStk:       132 kB
VmExe:         4 kB
VmLib:     27184 kB
VmPTE:     13420 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:        84
SigQ:   0/46634
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 2000000181005ccf
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs:     1
Seccomp:        2
Speculation_Store_Bypass:       thread vulnerable
Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list:      0-127
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        26
nonvoluntary_ctxt_switches:     2

T100D avatar Apr 24 '25 16:04 T100D

We forgot to mention in the docs that the agent relies on JVM mechanisms introduced in JDK 17 😕 (JDK-8254723). We should add a version check with more meaningful logging.

def avatar Apr 24 '25 16:04 def

Already thought it could be something like that, thank you.

T100D avatar Apr 24 '25 17:04 T100D

Something to keep in mind.

Updating the system binary for Java to 17 shows in the node log at first that scraping is possible, but this fails later on for Elasticsearch using it's own Java (15) binary.

T100D avatar Apr 25 '25 12:04 T100D

Additional log information would be very useful.

I encountered the following error:

failed to dump perfmap of JVM 18768: status:-

I went through several files to understand why many profiles contained [unknown] entries. The actual root cause wasn’t obvious until I read your post, @def

The error occurs here: https://github.com/coroot/coroot-node-agent/blob/34373d23fde80ab03a6b19d79e1af06e8f6f78b3/jvm/perfmap.go#L15

Dial returns *JVM and calls DumpPerfmap: https://github.com/coroot/coroot-node-agent/blob/34373d23fde80ab03a6b19d79e1af06e8f6f78b3/jvm/jattach.go#L70

It took me several hours of digging to discover that another prerequisite for using Coroot eBPF profiling is JDK >= 17. Giving some more verbose information to logs will save another from going that rabbit hole.

infor7 avatar Oct 21 '25 14:10 infor7

And not forget:

"Coroot automates this step by periodically calling jcmd in the background. However, the JVM must be started with the -XX:+PreserveFramePointer option. This allows for accurate stack traces and proper symbolization of JIT-compiled code, with only a small performance overhead (typically around 1-3%)."

https://coroot.com/blog/troubleshooting-java-applications-with-coroot/

T100D avatar Oct 21 '25 14:10 T100D

Yea coroot have clear statement about preserve frame pointers at https://docs.coroot.com/profiling/ebpf-based-profiling

We have nearly 2k independent apps in k8s envinronment but that was a first time when I have added preserve frame pointer flag and profiles has not been displayed properly on java app.

My first thought was that java sidecar deployed next to main java app container in pod do not have flag. I work with developers to enable it but that was not the case, so I started digging into code and search the github issues and finally found that one.

infor7 avatar Oct 21 '25 15:10 infor7

I have done a change request in the JVM monitoring page.

T100D avatar Oct 21 '25 15:10 T100D