accumulo icon indicating copy to clipboard operation
accumulo copied to clipboard

Create pluggble server-side UncaughtExceptionHandler

Open dlmarion opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe.

Accumulo 2.1.x included a default non-configurable UncaughtExceptionHandler that terminates the VM on Error. There are some Error's that are not fatal, like a NoClassNotFoundError. With the use of ContextClassLoader's it's possible for an administrator to incorrectly configure the context, leading to a NoClassNotFoundError at scan time, which would cause the termination of one or more TabletServers. While this could be rare, the impact could be huge.

Describe the solution you'd like A pluggable server-side UncaughtExceptionHandler

dlmarion avatar Oct 19 '23 15:10 dlmarion

It seems like this isn't necessary, as this can be handled in the ContextClassLoaderFactory implementation itself with a well-written implementation.

Our uncaught exception handler is to protect Accumulo itself from unknowable JVM states, often caused by bad user pluggable code that Accumulo cannot safely reason about. Even something that seems as simple as NoClassDefFoundError can be accompanied by serious state changes in the JVM. For example, a NoClassDefFoundError can occur because of an exception in a static initializer block while trying to load a class. One problem is that that static initializer can change some JVM-wide state before it fails, leading to an unknowable state. So, even though the class couldn't load, it doesn't mean the JVM is in a knowable good state. These errors are fatal because they can leave the JVM in a bad, often unknowably bad, state. The NoClassDefFoundError can also be caused by a previous failure to load a class that was cached by the system classloader. The first time it tried to load, it might throw an ExceptionInInitializerError or other LinkageError. Subsequent attempts might simply report NoClassDefFoundError because the classloader is aware that there's no valid class definition available, and this would not be recoverable.

These kinds of errors are different from say, ClassNotFoundException, which is a recoverable error that can happen when trying to dynamically load a class with something like Class.forName or similar. This can be recoverable if the reason it failed to load was because an item was missing from the classpath and has been subsequently placed there, or if there was a config typo and that typo was subsequently fixed.

In either case, though, most of these situations can be addressed by having a good classloader implementation that behaves the way the user wants it to, or by having good configuration management to ensure their system is deployed properly. I think the least best option is allowing users to muck about with changing how Accumulo protects itself and the user data it is responsible for, as the result of user errors during deployment, and that's one role I think the server-side uncaught exception handler is often doing.

I think there are better approaches already available to users that have not been exhausted, in particular the use of a robust pluggable ContextClassLoaderFactory implementation that satisfies the user's behavioral requirements, that would be far less risky.

ctubbsii avatar Oct 20 '23 09:10 ctubbsii

I think there is a case where a NoClassDefFoundError could be thrown without the use of a context set on a table and the ContextClassLoaderFactory. Consider an iterator class that is configured for a scan, which would be loaded here. The iterator class could be found, but could possibly be dependent on a class in a jar that is not on the classpath.

I believe in this scenario a NoClassDefFoundError would be thrown and the TabletServer terminated. The scan for the Tablet would fail, the Tablet would be re-hosted, and possibly take down another TabletServer. If it's the case that an adminnistrator forgot to deploy the jar on several or all TabletServers, then this could cascade to many or all TabletServers.

dlmarion avatar Oct 23 '23 20:10 dlmarion

Rather than a low-level bypass of JVM error handling, I would rather focus our attention on safer alternative solutions to protect users from common configuration management errors, such as (possibly):

  • Encourage the use of our testClassLoad API. Maybe improve this if it has shortcomings,
  • Add high level checks during the setup of the iterator stack, catching ClassNotFoundException, or add a "check" method to iterators (if they implement a Checkable interface), so we can get fast (and recoverable) failures from scans,
  • Automatically do a testClassLoad check whenever a table property is attempted to be set that involves a property of type CLASSNAME (and on startup), similar to how we now check the context name in #3683

ctubbsii avatar Oct 25 '23 05:10 ctubbsii

NoClassDefFoundError can be accompanied by serious state changes in the JVM. For example, a NoClassDefFoundError can occur because of an exception in a static initializer block while trying to load a class. One problem is that that static initializer can change some JVM-wide state before it fails, leading to an unknowable state. So, even though the class couldn't load, it doesn't mean the JVM is in a knowable good state. These errors are fatal because they can leave the JVM in a bad, often unknowably bad, state.

Based on this and its preceding sections, I"m not sure that this can happen. It appears that when loading a class the object hierarchy is loaded, resolved, and verified before any initialization happens.

The problem with the testClassLoad method, if I remember correctly what it does, is that it reaches out to a random TabletServer to see if it can load a class. If a subset of the TabletServers are missing a jar, then this would only randomly test the condition, but would always terminate the TabletServer when the user scan attempts to scan a hosted Tablet. Your second bullet, checks during the setup of the iterator stack, might be a good compromise. If we added LinkageError to the catch statement here for example.

dlmarion avatar Oct 25 '23 11:10 dlmarion

@ctubbsii I did handle all errors in a separate class loader and class loader factory. This time the errors were outside of that path. Indeed my class loader threw a class not found exception, and the system class loader turned that into a class no def error which subsequently halted the tserver. Now I have to change the uncaught exception handler in our fork to catch LinkageError as well and not halt in that case. There may be a better place to put that catch as @dlmarion suggests to narrow the scope down to just scans. I will dig up the stack trace tomorrow and send it around.

The basic issue here is that any user could setup a scan (intentionally or unintentionally) with classes that end up with this error and take down the system. This cannot be allowed to happen.

ivakegg avatar Oct 25 '23 12:10 ivakegg

For scan time iterators, I definitely think we should check that the specified class is available on the class path before we allow the scan to continue. It definitely shouldn't kill the tablet server if the user specified a scan with an unavailable iterator. That's not a configuration management error. That's a fail on Accumulo's part in validating user input. You shouldn't have to override the default behavior to get that. That should just happen by default, at least for CNFE. For an iterator that is available, but whose dependencies aren't, and that causes NoClassDefFoundError, that admin error is harder to deal with by default, as the error can cause problems later and may not be recoverable and it's not possible for Accumulo to automatically know if it is or isn't. But we can at least do the first thing by default, so the user doesn't have to do custom things for that.

ctubbsii avatar Oct 25 '23 12:10 ctubbsii