Add support for CRaC to the Java Agent
Is your feature request related to a problem? Please describe.
Determine how to allow the Java Agent to support CRaC
Feature Description
Our customer is looking to implement CRaC within their environments to reduce the warm up time when scaling. This is very important to them, and they are now finding that they can’t even take a snapshot when the Java Agent is enabled due to the number of open files it has.
Their application is Spring Boot based, which now supports CRaC.
Describe Alternatives
N/A
Additional context
Original FR - NR-182805
Priority
Must Have
https://new-relic.atlassian.net/browse/NR-216517
Moving this back to the backlog as we've run into some hurdles, I'll describe my findings here.
As a bit of information, an essential aspect of CRaC is that no file handles can be open during the checkpointing process. In my testing I used the spring-petclinic app with and without the agent attached and found that with the agent attached we had numerous file handles open. I was unable to checkpoint a running JVM with the agent attached, so I was never able to even attempt to restore one with the agent attached. Thus, my findings here are limited to checkpointing only.
Note: When attempting to checkpoint the JVM, if you used the -Djdk.crac.collect-fd-stacktraces=true option during startup, any exceptions thrown due to open handles will include a stacktrace of where the handle was opened. If an open handle has no accompanying stacktrace, it was created in native code.
- Our .old and .new class files before and after weaving. In this method we are not closing these files after writing them. Sometimes they appear to get closed on their own, sometimes not. This is a simple matter of closing those files when we are done.
- Our log file. This was a simple matter of implementing the API Interface in the appropriate place to close the log file and re-open it when needed. The first wrinkle would be making sure we re-open correctly when the underlying system may have changed and the previous log file is no longer there. The second wrinkle is what, if anything, to do with any messages that come in after we have closed the log file, but before the checkpointing is complete, especially if checkpointing never actually finished successfully and the JVM stays running, I did not explore this wrinkle.
- Backend collector connection. Again, a simple matter of implementing the API (perhaps here) to close the connection and re-open it.
- Log config inside of newrelic.jar. This is an issue that I did not work through to solution. The file is actually opened by Log4J, not by us, during this call. There is a chance we may be able to get around that, or we may have to involve Log4J. I did not try to solve this problem yet, because the next problem became higher priority.
- Temp instrumentation JARs. During agent premain startup we add several agent-related instrumentation JARs to the bootstrap class loader. The running JVM appears to be holding on to those file handles in native code and we are unable to close them successfully. We have engaged the CRaC team at Azul to consult.