openj9 icon indicating copy to clipboard operation
openj9 copied to clipboard

How can I avoid a hang on error during CRIU checkpoint?

Open singh264 opened this issue 1 year ago • 4 comments

It seems like the below steps recreate the hang on error during CRIU checkpoint

  1. Obtain a Ubuntu 22.04 machine
  2. Install CRIU on the machine
  3. Download a build with an openj9 implementation on the machine
  4. Create a file with vi Demo.java on the machine
  5. Copy the following code in the file on the machine
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.LinkedList;
import java.util.List;
import java.io.PrintStream;
import java.io.File;
import java.io.*;
import org.eclipse.openj9.criu.CRIUSupport;


public class Demo {

	public static void main(String args[]) throws Throwable {
		System.out.println("pre -checkpoint");
		checkPointJVM("cpData");
                System.out.println("post -checkpoint");

	}
	public static void checkPointJVM(String path) {
		if (CRIUSupport.isCRIUSupportEnabled()) {
                       	       new CRIUSupport(Paths.get(path))
                                                        .setLeaveRunning(false)
                                                        .setShellJob(true)
                                                        .setFileLocks(true)
							.checkpointJVM();

                 } else {
			System.err.println("CRIU is not enabled\n" + CRIUSupport.getErrorMessage());
                 }

	}
}
  1. Create a directory with mkdir cpData on the machine
  2. Compile the code with javac Demo.java on the machine
  3. Recreate the hang on error with java -XX:+EnableCRIUSupport Demo on the machine

singh264 avatar Jun 13 '24 18:06 singh264

It seems like the next step is reproduce the hang on error during checkpoint and to find the root cause of the problem that needs to be addressed.

singh264 avatar Aug 01 '24 20:08 singh264

Dear @tajila and @babsingh, I would like to be assigned to this issue in order to start work on it. I look forward to your response.

singh264 avatar May 28 '25 21:05 singh264

Response is not existent and therefore @pshipton I would like you to assign this issue to me in order to address it.

singh264 avatar Jun 04 '25 15:06 singh264

@tajila ?

pshipton avatar Jun 05 '25 00:06 pshipton

@tajila I hope you are doing well, and I would like to inform you about my unsuccessful attempt to reproduce the hang, which is detailed in the below terminal output, in order to know if the hang is still expected or if the steps should result in no error during CRIU checkpoint.

singh264@linux:~$ ls
ant-lib  cpData  Demo.java  mkdocker.sh  openj9_build
singh264@linux:~$ 
singh264@linux:~$ cat Demo.java 
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.LinkedList;
import java.util.List;
import java.io.PrintStream;
import java.io.File;
import java.io.*;
import org.eclipse.openj9.criu.CRIUSupport;


public class Demo {
	public static void main(String args[]) throws Throwable {
		System.out.println("pre -checkpoint");
		checkPointJVM("cpData");
                System.out.println("post -checkpoint");

	}
	public static void checkPointJVM(String path) {
		if (CRIUSupport.isCRIUSupportEnabled()) {
                	CRIUSupport.getCRIUSupport()
				.setImageDir(Paths.get(path))
                		.setLeaveRunning(false)
                		.setShellJob(true)
                		.setFileLocks(true)
				// remove this if running as a non-root user 
				.setUnprivileged(true)
				.checkpointJVM();
                 } else {
			System.err.println("CRIU is not enabled\n" + CRIUSupport.getErrorMessage());
                 }

	}
}
singh264@linux:~$ 
singh264@linux:~$ javac -version
javac 21.0.8-internal
singh264@linux:~$ 
singh264@linux:~$ javac Demo.java
singh264@linux:~$ 
singh264@linux:~$ ls
ant-lib  cpData  Demo.class  Demo.java  mkdocker.sh  openj9_build
singh264@linux:~$ 
singh264@linux:~$ java -version
openjdk version "21.0.8-internal" 2025-07-15
OpenJDK Runtime Environment (build 21.0.8-internal-adhoc.singh264.openj9-openjdk-jdk21)
Eclipse OpenJ9 VM (build master-5f6a02d948, JRE 21 Linux aarch64-64-Bit Compressed References 20250627_000000 (JIT enabled, AOT enabled)
OpenJ9   - 5f6a02d948
OMR      - 41204d221
JCL      - ad709377fba based on jdk-21.0.8+6)
singh264@linux:~$ 
singh264@linux:~$ java -XX:+EnableCRIUSupport Demo
pre -checkpoint
JVMJITM048W AOT load and compilation disabled pre-checkpoint and post-restore.
Exception in thread "main" org.eclipse.openj9.criu.SystemCheckpointException: Could not dump the JVM processes, err=-52
	at openj9.criu/org.eclipse.openj9.criu.CRIUSupport.checkpointJVM(CRIUSupport.java:593)
	at Demo.checkPointJVM(Demo.java:27)
	at Demo.main(Demo.java:14)
Caused by: openj9.internal.criu.SystemCheckpointException: Could not dump the JVM processes, err=-52
	at java.base/openj9.internal.criu.InternalCRIUSupport.checkpointJVMImpl(Native Method)
	at java.base/openj9.internal.criu.InternalCRIUSupport.checkpointJVM(InternalCRIUSupport.java:1151)
	at openj9.criu/org.eclipse.openj9.criu.CRIUSupport.checkpointJVM(CRIUSupport.java:587)
	... 2 more

singh264 avatar Jun 30 '25 19:06 singh264

Likely, there is an issue with priviledges. One thing you try is to just run with sudo and set .setUnprivileged(false). Otherwise you can investigate the issue by setting logs (setLogLeveL) to 4 and looking at the logs.

tajila avatar Jun 30 '25 20:06 tajila

Logs can be good to solve the problem of seeing an error during checkpoint and move towards doing a checkpoint without any errors, and @tajila would you mind confirming that this was your intention as well?

singh264 avatar Jul 02 '25 19:07 singh264

Well, the code runs in privileged mode, which expects the user to be a root user, by default, and based on the below terminal output no CRIU checkpoint errors occur despite the fact that I am a non-root user on my machine, and therefore I would like you to confirm that the expected behaviour is that we should detect this discrepency and report an error message to the user.

singh264@linux:~$ ls
ant-lib  cpData  criuOutput  Demo.java  mkdocker.sh  openj9_build
singh264@linux:~$ 
singh264@linux:~$ cat Demo.java 
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.LinkedList;
import java.util.List;
import java.io.PrintStream;
import java.io.File;
import java.io.*;
import org.eclipse.openj9.criu.CRIUSupport;


public class Demo {
	public static void main(String args[]) throws Throwable {
		System.out.println("pre -checkpoint");
		checkPointJVM("cpData");
                System.out.println("post -checkpoint");

	}
	public static void checkPointJVM(String path) {
		if (CRIUSupport.isCRIUSupportEnabled()) {
                	CRIUSupport.getCRIUSupport()
				.setImageDir(Paths.get(path))
                		.setLeaveRunning(false)
                		.setShellJob(true)
                		.setFileLocks(true)
				.checkpointJVM();
                 } else {
			System.err.println("CRIU is not enabled\n" + CRIUSupport.getErrorMessage());
                 }

	}
}
singh264@linux:~$ 
singh264@linux:~$ echo $JAVA_HOME; $JAVA_HOME/bin/javac -version
/home/singh264/openj9_build/openj9-openjdk-jdk21/build/linux-aarch64-server-release/images/jdk
javac 21.0.8-internal
singh264@linux:~$ 
singh264@linux:~$ $JAVA_HOME/bin/javac Demo.java 
singh264@linux:~$ 
singh264@linux:~$ ls
ant-lib  cpData  criuOutput  Demo.class  Demo.java  mkdocker.sh  openj9_build
singh264@linux:~$ 
singh264@linux:~$ sudo $JAVA_HOME/bin/java -XX:+EnableCRIUSupport Demo 
pre -checkpoint
Killed
singh264@linux:~$ 

singh264 avatar Jul 03 '25 19:07 singh264

The user was root as I ran the CRIU checkpoint code with sudo, my apologies, so this issue can be closed as it was created assuming default configuration where the code expects the user to be a root user, and I believe a good follow-up issue can be to avoid an error, which is detailed below, during CRIU checkpoint as a non-root user.

singh264@linux:~$ ls
ant-lib  cpData  Demo.java  mkdocker.sh  openj9_build
singh264@linux:~$ 
singh264@linux:~$ cat Demo.java 
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.LinkedList;
import java.util.List;
import java.io.PrintStream;
import java.io.File;
import java.io.*;
import org.eclipse.openj9.criu.CRIUSupport;


public class Demo {
	public static void main(String args[]) throws Throwable {
		System.out.println("pre -checkpoint");
		checkPointJVM("cpData");
                System.out.println("post -checkpoint");

	}
	public static void checkPointJVM(String path) {
		if (CRIUSupport.isCRIUSupportEnabled()) {
                	CRIUSupport.getCRIUSupport()
				.setImageDir(Paths.get(path))
                		.setLeaveRunning(false)
                		.setShellJob(true)
                		.setFileLocks(true)
				// remove this if running as a root user 
				.setUnprivileged(true)
				.checkpointJVM();
                 } else {
			System.err.println("CRIU is not enabled\n" + CRIUSupport.getErrorMessage());
                 }

	}
}
singh264@linux:~$ 
singh264@linux:~$ javac -version
javac 21.0.8-internal
singh264@linux:~$ 
singh264@linux:~$ javac Demo.java
singh264@linux:~$ 
singh264@linux:~$ ls
ant-lib  cpData  Demo.class  Demo.java  mkdocker.sh  openj9_build
singh264@linux:~$ 
singh264@linux:~$ java -version
openjdk version "21.0.8-internal" 2025-07-15
OpenJDK Runtime Environment (build 21.0.8-internal-adhoc.singh264.openj9-openjdk-jdk21)
Eclipse OpenJ9 VM (build master-5f6a02d948, JRE 21 Linux aarch64-64-Bit Compressed References 20250627_000000 (JIT enabled, AOT enabled)
OpenJ9   - 5f6a02d948
OMR      - 41204d221
JCL      - ad709377fba based on jdk-21.0.8+6)
singh264@linux:~$ 
singh264@linux:~$ java -XX:+EnableCRIUSupport Demo
pre -checkpoint
JVMJITM048W AOT load and compilation disabled pre-checkpoint and post-restore.
Exception in thread "main" org.eclipse.openj9.criu.SystemCheckpointException: Could not dump the JVM processes, err=-52
	at openj9.criu/org.eclipse.openj9.criu.CRIUSupport.checkpointJVM(CRIUSupport.java:593)
	at Demo.checkPointJVM(Demo.java:27)
	at Demo.main(Demo.java:14)
Caused by: openj9.internal.criu.SystemCheckpointException: Could not dump the JVM processes, err=-52
	at java.base/openj9.internal.criu.InternalCRIUSupport.checkpointJVMImpl(Native Method)
	at java.base/openj9.internal.criu.InternalCRIUSupport.checkpointJVM(InternalCRIUSupport.java:1151)
	at openj9.criu/org.eclipse.openj9.criu.CRIUSupport.checkpointJVM(CRIUSupport.java:587)
	... 2 more

singh264 avatar Jul 03 '25 19:07 singh264

A non-root user's behaviour in unprivileged mode during CRIU checkpoing should be the same as root user's behaviour in privileged mode, which is no error, therefore @tajila or @JasonFengJ9, can one of you please confirm that creating a new GitHub issue in order to fix the checkpoint error for a non-root user, which is detailed in my pervious comment, is good?

singh264 avatar Jul 04 '25 20:07 singh264

I would like to clarify that the aforementioned observations were made on a local VirtualBox arm64 Linux machine that was running on a Mac machine, and since this virtual machine was setup improperly to run CRIU tests, I am providing the output of my unsuccesful attempt to recreate the hang during CRIU checkpoint on a GitHub Codepsaces x86_64 Linux machine:

singh264 ➜ /workspaces/codespaces-blank $ ls
Demo.java  cpData  criuOutput  mkdocker.sh  openj9-build
@singh264 ➜ /workspaces/codespaces-blank $ 
@singh264 ➜ /workspaces/codespaces-blank $ cat Demo.java 
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.LinkedList;
import java.util.List;
import java.io.PrintStream;
import java.io.File;
import java.io.*;
import org.eclipse.openj9.criu.CRIUSupport;


public class Demo {
        public static void main(String args[]) throws Throwable {
                System.out.println("pre -checkpoint");
                checkPointJVM("cpData");
                System.out.println("post -checkpoint");

        }
        public static void checkPointJVM(String path) {
                if (CRIUSupport.isCRIUSupportEnabled()) {
                        CRIUSupport.getCRIUSupport()
                                 .setImageDir(Paths.get(path))
                                .setLeaveRunning(false)
                                .setShellJob(true)
                                .setFileLocks(true)
                                .checkpointJVM();
                } else {
                        System.err.println("CRIU is not enabled\n" + CRIUSupport.getErrorMessage());
                }
        }
}
@singh264 ➜ /workspaces/codespaces-blank $ 
@singh264 ➜ /workspaces/codespaces-blank $ ./openj9-build/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/bin/javac Demo.java 
@singh264 ➜ /workspaces/codespaces-blank $ 
@singh264 ➜ /workspaces/codespaces-blank $ sudo ./openj9-build/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/bin/java -XX:+EnableCRIUSupport Demo
pre -checkpoint
Killed
@singh264 ➜ /workspaces/codespaces-blank $ 
@singh264 ➜ /workspaces/codespaces-blank $ sudo criu restore -D ./cpData -v2 --shell-job
JVMJITM048W AOT load and compilation disabled pre-checkpoint and post-restore.
post -checkpoint

@tajila would you mind confirming the above output alongside with the scope that defined in the description of this issue is sufficient to close this issue?

singh264 avatar Aug 18 '25 03:08 singh264