openj9
openj9 copied to clipboard
Liberty InstantOn checkpoint failed on Java11 UBI9-min on Power when using daytrader10 app
Perform checkpoint on daytrader10 app with embedded JMS on Power VM with Ubuntu 22.04, and it failed with the following errors. I do not see this problem when checkpoint with daytrader10 with MQ on Power
CWWKE0962E: The server checkpoint request failed. The following output is from the CRIU /logs/checkpoint/checkpoint.log file that contains details on why the checkpoint failed.
Warn (criu/kerndat.c:1153): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn (criu/kerndat.c:1153): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Warn (compel/src/lib/infect.c:133): Unable to interrupt task: 1138 (Operation not permitted)
Error (criu/parasite-syscall.c:478): Can't retrieve FDs from socket
Error (compel/src/lib/infect-rpc.c:51): Communication error, this is not the ack we expected
Error (criu/cr-dump.c:1674): Dump files (pid: 1025) failed with -1
pie: 1025: Trimmed message received (12/-104)
Error (criu/cr-dump.c:2098): Dumping FAILED.
- Kernel level: 5.15.0-131-generic
- OS
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
I posted logs with CRIU trace in instanton-group-work slack
@tajila
I only see this problem when checkpoint using OL/WL Java 11 UBI9 image on daytrader10 (with embedded JMS) app
This appears to be caused by the ulimit for number of files open being too low. I was only able to reproduce this by attempting to do the checkpoint via podman build, where I saw that the limit on files open was 1024. Taking the checkpoint via podman run ... checkpoint.sh worked fine because the default limit was much higher. With the 1024 limit on some machines I saw the failure described above, and on others I saw a Java exception that clearly stated that a file couldn't be opened, which gave me a clue as to what could be wrong.
It looks like we're probably very close to the 1024 limit across various environments and Java versions, and this particular combination (Daytrader 10 + embedded JMS on Java 11 on Power) is the lucky loser.
I was able to build the checkpoint image by first increasing the ulimit on the host, then doing podman build --ulimit=nofile=8192 ....
I'll look into whether or not we can improve the error handling in CRIU in this part of its code to report the actual cause of the failure.