kafka KAFKA-14244: Add guard against accidental calls to halt JVM during testing

Jira

Uses an automatically-registered JUnit 5 extension that installs a SecurityManager which intercepts attempts to terminate the JVM that appear to have come from test cases, and then reports those attempts by causing tests to fail. If the test that attempted to terminate the JVM is running at the time of the attempt, that test will fail. If not, a random test will fail at a later point (in order to fail the CI build and surface the issue), with a message reporting that a prior test appears to have leaked a thread which has attempted to terminate the JVM after that test completed.

The InheritableThreadLocal class is used to track which tests attempt to terminate the JVM, including cases where the attempt occurs in a separate thread, and even when the attempt has occurred after the test has already completed.

Although the SecurityManager API was deprecated and slated for removal in JEP-411, there is no existing alternative for intercepting attempts to terminate the JVM (see https://bugs.openjdk.org/browse/JDK-8199704). Once an alternative becomes available, we can use it instead of the SecurityManager API.

This is heavily inspired by the system-rules library, with some modifications to support parallel testing.

It's worth noting that there is a potentially much-simpler fix where we modify the Exit wrapper classes to use InheritableThreadLocal instances to track custom procedures. However, this does not automatically prevent the JVM from being accidentally terminated during testing (tests still have to remember to use the wrapper class), and it does not provide a way to fail the build when unexpected attempts to terminate the JVM take place, potentially swallowing dangerous failures during testing and CI.

Committer Checklist (excluded from commit message)

[ ] Verify design and implementation
[ ] Verify test coverage and CI build status
[ ] Verify documentation (including upgrade notes)

Sep 20 '22 15:09 C0urante

@C0urante , interesting solution. But I'm wondering how you test it and prove this can really catch Exit and report error?

Sep 21 '22 02:09 showuon

This is a good idea and definitely an improvement over what already exists. But I think that the downside of taking a dependency on something that is deprecated as soon as JDK 18 is quite significant. Perhaps, we can solve the underlying problem in a different way!

Let me explain my point of view on this.

Some of our tests spawn new threads. The expectation for the happy case path is that these threads are terminated by the test gracefully at the end of test execution. In case when these threads are not terminated, they may cause erroneous/undeterministic behaviour such as calling System.Exit().

Instead of focusing on Exit() (which is not the underlying cause, it's a side effect), could we try to formulate a solution which ensures that the test would fail if threads created by that test are not terminated at the end of the execution?

I don't have a proposal towards a solution but there are ideas that we could explore such as ensuring all threads created by a test have a property in their thread local (this could be set in @before and removed in @after).

Thoughts?

Sep 21 '22 09:09 divijvaidya

@showuon Thanks for taking a look. I did some local testing by injecting calls to System::exit into existing unit tests, both on the actual test thread and on separate threads with a little sleep time added to allow the test case to complete before attempting to terminate the JVM. If this PR seems promising I can take a stab at adding some testing logic for the extension/security manager; LMKWYT.

Sep 21 '22 16:09 C0urante

@divijvaidya Thank you for your thoughts. I don't quite agree that attempting to terminate the JVM is a symptom rather than the root cause, since there are cases where these attempts occur without threads being leaked (one such case was touched on in https://github.com/apache/kafka/pull/9698). In addition, the main issue here is that these attempts tend to crop up only during CI builds and not locally, in which case the cost of allowing the JVM to be terminated is higher as debugging is harder and the build attempt is rendered useless.

I'm also not very worried about the deprecation of the SecurityManager API since it's unlikely that the API will be fully removed without adding in an alternative way to intercept calls to terminate the JVM; this use case is called out repeatedly in JEP-411 as a valuable one. I've updated the PR to be safe with Java 18 with some small conditional logic in the Gradle build file. This should lay some decent groundwork for the future if we want to apply different means of intercepting calls to terminate the JVM depending on which Java version we're building with.

That said, if you have the time to develop and implement an alternative proposal, I'd be happy to review a POC. I'm not certain we really can track thread creation and termination during tests without a serious refactoring of the code base, but if there's a lightweight way to do this, it would probably be worth considering either instead of or in addition to hardening our testing logic against JVM termination.

Sep 21 '22 16:09 C0urante

@C0urante Thank you for explaining. I stand corrected that thread leaks are the only cause of System.Exit. Your proposed approach sounds good to me. For the thread leak scenario, we can perhaps start using TestUtils.verifyNoUnexpectedThreads in modules other than core to as an additional guard.

Sep 22 '22 11:09 divijvaidya

@showuon I've added some tests for the custom extension that should demonstrate its effectiveness.

I should note that I also experimented with an approach that leveraged JUnit 5's test class ordering API to create a test that runs after all others have completed and checks for leaked threads that have tried to terminate the JVM, instead of performing that check after every test and failing potentially-unrelated tests in order to surface the problem. Unfortunately, I couldn't find a way to make this work with parallel testing, since the test would only be run on a single executor, and would often finish before other tests on other executors.

Sep 23 '22 20:09 C0urante

@C0urante You mentioned in your PR that we could potentially adjust the Exit classes, but it would not solve the problem as well. Is the main issue that we have System.exit/halt calls that are not going via the Exit classes or something else?

Oct 06 '22 16:10 ijuma

@ijuma It's partially that we sometimes have direct calls to System::exit (although it's rare that these would make it past review), and partially that we wouldn't have a way to automatically fail tests that make unexpected calls to Exit::exit or Exit::halt.

I think if we could find a way to address the latter point, it'd be worth it to just modify the existing Exit classes instead of installing the custom extension here. But I haven't found a way to do that yet.

Oct 10 '22 18:10 C0urante

I think there's no way around the junit extension, but I was wondering if we could use that with the Exit classes versus the security manager approach. Then we don't have to depend on the deprecated security manager. But if that's too difficult, then we can go with the security manager approach and adjust it for newer JDK versions when/if a new api is introduced to replace it.

Oct 14 '22 14:10 ijuma

@ijuma Sorry for the delay.

The tradeoff involved is whether we want to account for direct calls to System::exit or not. If we're confident enough in our review process to catch those before they get merged, or at least to quickly identify them after they've been merged and started causing issues in CI, then it should be fine to remove the security manager and move everything into the Exit class. I'll give that a shot and see if I can push a new draft by the end of the week.

Oct 31 '22 19:10 C0urante

I've updated the PR to not use a custom SecurityManager, and drop support for catching direct calls to System::exit.

@divijvaidya @ijuma If you have time, let me know what you think; hoping this is readable to fresh eyes, but want to make sure.

Nov 01 '22 20:11 C0urante

Closing due to lack of interest; we can revisit this if necessary.

Jun 06 '23 17:06 C0urante