openj9 Question regarding a corner case during restore

trafficstars

I have a question regarding dealing with a pretty obscure corner case during restore. This came about because of issues we had seen when checkpointing on one cpu impl and restore on another (eg amd -> intel or vice versa). I realized that with AOT, in a load run we verify that the CPU features match the set of features used in the compile run. With checkpoint/restore, we're just assuming that portable code means it'll work on all machines. Because we use a very limited set of features, it is very unlikely that we're going to come across a machine that does not have one of those features. However, I suppose if one was to restore on a sufficiently old machine, it could result in a SIGILL.

So I've started working on a prototype wherein the JIT can coordinate with the VM to abort the restore. However, the current failure path involves the VM throwing an exception. The problem though is that there can be code that handles that exception. If the method that invokes the API to checkpoint was compiled, we're not preventing JIT'd code from executing.

Technically we have mechanisms that are used under FSD that allow a thread to bail from JIT'd code and only run interpreted but that feels like an awfully big hammer for this problem that is after all, very obscure. A simpler approach would be for the JVM to just exit immediately, though that's not as graceful.

As such, what do we think is the best approach for this?

Sep 13 '22 19:09 dsouzai

My preference would be to restore the process but throw away the JIT code with a big warning that it happened. Presumably the JIT will then warm back up but at least the process will be there. Though a bigger hammer, it at least results in a running process. I'm not sure doing more to make a hopefully obscure case 'better' is worth the investment, though if there are simple improvements to significantly lighten the penalty, that may still be worth it.

Sep 14 '22 14:09 mstoodle

@mstoodle's answer makes sense in the case of rare deployments or desktop usage but I think for highly-scaled deployments (ie containers in k8s), I'd rather have the process exit than run slowly.

If I went thru the trouble of preparing a checkpoint and deploying it, I want to know immediately if it can't run and that I'm not getting the benefit from it. "Slow" should be an error case for this style of deployment.

For that reason, I'd propose exiting if a system doesn't meet the portability requirements.

Sep 14 '22 16:09 DanHeidinga

There are some non-trivial technical difficulties associated with continuing the process but throwing away the JIT'd code. For one, we would need to run in a mode similar to (but not exactly) FSD since we would need the ability to do an involuntary OSR out of JIT'd code. This not only means that image generated by CRIU is going to be bigger (because we need more data for all the OSR infra) but also the SCC Is going to have to be bigger (something we had to do when we got FSD working with AOT). The criu checkpoint image being bigger is known to slow down startup. Additionally, we would need to add the infra to completely throw away the entire code cache and start again; I don't know to what extent something like this exists right now so that would need to be implemented as well.

That said, because this CPU feature validation happens in the restore hooks, all java threads are stopped so theoretically the VM can transition all the java threads to the interpreter (since they should all be at a well defined yield point and therefore should have OSR data).

@tajila @gacholio how easy (or difficult) would it be to add the ability to transition all java threads (including the one in the CRIU helper JNI code) to the interpreter if the JIT'd code can't be executed?

Sep 14 '22 16:09 dsouzai

Frankly, I don't see much difference between the warning and the hard failure. If we discard all of the JIT code, you've already lost most of the benefit of the image.

I would find it hard to justify the work required for this since any real case would require rebuilding the image anyway. Generating slower/larger JIT code in all cases in order to support this edge case also seems a waste.

Sep 14 '22 17:09 gacholio

I guess I was thinking from the perspective of a production instance that is created to handle load where only some of the available machines might be "too old". As a user, I might prefer a slow response to no response.

If the failure mode is exit, what is the recommended remedy? Remove all the old hardware from your machine pool?

Sep 14 '22 18:09 mstoodle

I recall in OCP, for example, you could choose what nodes you wanted deployments to run on. If only some of the machines are too old, then you could just deploy on the ones that aren't. Otherwise, if all the machines are too old, then they might have to build the image on one of those old machines.

I suppose a subtlety like that deserves documentation so that users who want to use all machines are aware that the image needs to be built on the oldest machine they have in their pool.

Sep 14 '22 18:09 dsouzai

From a VM perspective, if the compiled code is prepared for external OSR then I don't see any obvious issues - we do exactly this for various debug things like single step.

Conveniently, checkpoints are already created with all threads at OSR safe points.

Sep 14 '22 19:09 gacholio

With checkpoint/restore, we're just assuming that portable code means it'll work on all machines.

If it doesn't, isn't that the problem? The point of generating portable code is for it to work on other machines. Why would that portable code use any features that might be absent on CPUs that the JIT normally supports?

Sep 14 '22 19:09 jdmpapin

Why would that portable code use any features that might be absent on CPUs that the JIT normally supports?

Well, AFAICT the support statement doesn't speak about CPUs. As such, we've hand picked a set of features for portable code. If a machine does not have any of those hand picked features then we won't use it. However, if the checkpoint run uses one of those features but the restore is run on a machine that doesn't have that feature, then that's where we might run into an issue.

I suppose this is analogous to static compilers; they target some "old" CPU to be as broad as possible but it is possible to take such a binary and run into a super old machine (lets assume all libs are statically linked) and run into the same problem.

Maybe that set of features above is too permissive and is something that needs to be reevaluated; maybe it makes sense for Portable AOT, but not for Portable Code in general. @mpirvu do you have an opinion on this (since you were involved in picking those features when Portable SCC was first introduced)?

Sep 14 '22 20:09 dsouzai

For portableAOT we require to have the features available in the SandyBridge architecture.

  const uint32_t customFeatures [] = {OMR_FEATURE_X86_FPU, OMR_FEATURE_X86_CX8, OMR_FEATURE_X86_CMOV,
                                       OMR_FEATURE_X86_MMX, OMR_FEATURE_X86_SSE, OMR_FEATURE_X86_SSE2,
                                       OMR_FEATURE_X86_SSSE3, OMR_FEATURE_X86_SSE4_1, OMR_FEATURE_X86_POPCNT,
                                       OMR_FEATURE_X86_SSE3, OMR_FEATURE_X86_AESNI, OMR_FEATURE_X86_AVX
                                      };

The reason behind this decision is that SandyBridge is quite old (2011) and we don't expect people to run on machines older than that. There is also the implicit assumption that features available in SandyBridge will continue to be available in the newer processors.

Sep 16 '22 12:09 mpirvu

openj9 openj9 copied to clipboard

Question regarding a corner case during restore

openj9
openj9 copied to clipboard