openj9 icon indicating copy to clipboard operation
openj9 copied to clipboard

Liberty InstantOn checkpoint failed on Java11 UBI9-min on Power when using daytrader10 app

Open tam512 opened this issue 6 months ago • 3 comments

Perform checkpoint on daytrader10 app with embedded JMS on Power VM with Ubuntu 22.04, and it failed with the following errors. I do not see this problem when checkpoint with daytrader10 with MQ on Power

CWWKE0962E: The server checkpoint request failed. The following output is from the CRIU /logs/checkpoint/checkpoint.log file that contains details on why the checkpoint failed.
Warn  (criu/kerndat.c:1153): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn  (criu/kerndat.c:1153): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn  (compel/src/lib/infect.c:133): Unable to interrupt task: 1138 (Operation not permitted)
Error (criu/parasite-syscall.c:478): Can't retrieve FDs from socket
Error (compel/src/lib/infect-rpc.c:51): Communication error, this is not the ack we expected
Error (criu/cr-dump.c:1674): Dump files (pid: 1025) failed with -1
pie: 1025: Trimmed message received (12/-104)
Error (criu/cr-dump.c:2098): Dumping FAILED.
  • Kernel level: 5.15.0-131-generic
  • OS

PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

I posted logs with CRIU trace in instanton-group-work slack

tam512 avatar May 15 '25 19:05 tam512

@tajila

pshipton avatar May 15 '25 19:05 pshipton

I only see this problem when checkpoint using OL/WL Java 11 UBI9 image on daytrader10 (with embedded JMS) app

tam512 avatar May 30 '25 14:05 tam512

This appears to be caused by the ulimit for number of files open being too low. I was only able to reproduce this by attempting to do the checkpoint via podman build, where I saw that the limit on files open was 1024. Taking the checkpoint via podman run ... checkpoint.sh worked fine because the default limit was much higher. With the 1024 limit on some machines I saw the failure described above, and on others I saw a Java exception that clearly stated that a file couldn't be opened, which gave me a clue as to what could be wrong.

It looks like we're probably very close to the 1024 limit across various environments and Java versions, and this particular combination (Daytrader 10 + embedded JMS on Java 11 on Power) is the lucky loser.

I was able to build the checkpoint image by first increasing the ulimit on the host, then doing podman build --ulimit=nofile=8192 ....

I'll look into whether or not we can improve the error handling in CRIU in this part of its code to report the actual cause of the failure.

ymanton avatar Jun 10 '25 14:06 ymanton