accumulo icon indicating copy to clipboard operation
accumulo copied to clipboard

Improve handling of interrupt conditions.

Open EdColeman opened this issue 4 years ago • 8 comments
trafficstars

Is your feature request related to a problem? Please describe. We are not consistent on handling interrupts.

Describe the solution you'd like

  1. Code should not swallow interrupts without at least resetting the interrupt flag - throwing an Exception if possible and appropriate.
  2. If an exception cannot be thrown, the caller code should be checking interrupt status and reacting if an interrupt occurs.

Describe alternatives you've considered The situation described below is a concrete examples, but similar issues occur in other places - these could be handled in bulk - or at least if the code is modified for other reason, corrected at that time on a case by case basis.

Additional context When reviewing #1890 - the code in BulkImportCacheCleaner caught the interrupt exception like this:

   } catch (KeeperException | InterruptedException e) {
      // we'll just clean it up again later
      log.debug("Error reading bulk import live transactions {}", e);
    }

The code was modified to reassert the interrupt with:

https://github.com/apache/accumulo/blob/307bef2fe6a8b2b7152eabaf9b16950cff72729a/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/BulkImportCacheCleaner.java#L57-L64

The caller code - in TabletServer schedules this task with the following:

https://github.com/apache/accumulo/blob/6af856b38bb57f49418547a8b9c262c66a1fe31e/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L778-L780

The call to scheduleWithFixedDelay returns a ScheduledFuture. that future could be used to attempt to cancel running tasks - or at least prevent new tasks from being scheduled if the interrupt status was checked. The goal would be to provide a more graceful approach to termination when possible (like closing file descriptors) - or at least improve termination responsiveness, The running tasks would probably need to properly handle mayInterruptIfRunning

Other considerations:

Some of the exception handling may have improved with the standardization of thread pools in #1808 - and maybe that code could also check for interrupts as an alternate approach.

This task may overlap with #946

EdColeman avatar Jan 29 '21 23:01 EdColeman

I recall discussing this in the past. There may be some old JIRA issues that this would supersede that could be closed with this.

ctubbsii avatar Jan 30 '21 00:01 ctubbsii

Some parts of the code could use this util: https://github.com/apache/accumulo/blob/3fd5cad92f9b63ac19e4466f3f2d5237b905262c/core/src/main/java/org/apache/accumulo/fate/util/UtilWaitThread.java#L49 But it should probably be moved out of the FATE package. It is already used in a bunch of ITs.

milleruntime avatar Feb 01 '21 14:02 milleruntime

Some parts of the code could use this util: https://github.com/apache/accumulo/blob/3fd5cad92f9b63ac19e4466f3f2d5237b905262c/core/src/main/java/org/apache/accumulo/fate/util/UtilWaitThread.java#L49

But it should probably be moved out of the FATE package. It is already used in a bunch of ITs.

This is a holdover from when accumulo-fate was a separate jar. A lot of the fate stuff was left in their current packages to avoid making any changes that might affect FaTE serialization. This one is probably safe to move, since it is a static void method only, and there's probably other generic utilities that can be moved as well, but since it's all internal, it doesn't matter that much to move it. But, if somebody does move it, they should be careful to avoid breaking FaTE serialization by moving serialized objects.

ctubbsii avatar Feb 01 '21 17:02 ctubbsii

It seems that there were multiple packaging decisions that were originally made with the FATE design - I believe the stated goal was that FATEs could become a stand-alone product outside of Accumulo. Should we revisit that? If moving the method would impact FATE serialization should other changes be made concurrently so that FATE serialization would only change once? Also, I believe that during upgrades it is necessary to insure that no FATE transactions are pending or the upgrade aborts - so changes in serialization should not be a factor between upgrades.

EdColeman avatar Feb 01 '21 18:02 EdColeman

It seems that there were multiple packaging decisions that were originally made with the FATE design - I believe the stated goal was that FATEs could become a stand-alone product outside of Accumulo.

While that may have been something that was considered, the implementation was, and has always been, very tightly coupled to Accumulo itself.

Should we revisit that? If moving the method would impact FATE serialization should other changes be made concurrently so that FATE serialization would only change once?

The problem is that we use Java serialization (in hindsight, a bad choice), and therefore, lots of Java objects can accidentally leak into our serialized FaTE objects. We have, for example, unintentionally serialized Thrift objects (I believe TInfo was one). I don't know of any currently active proposals to redesign FaTE to decouple it from Accumulo, or to change how we serialize objects there any time soon.

I don't see any compelling reason to revisit FaTE serialization at this time. But, if we did do something like that, it certainly would be better to do it across the board all at once. One future compelling reason might be the introduction of record objects in a future Java release, which may have safe and reliable serialization strategies. Another compelling reason would be to take advantage of some protobuf or thrift serialization for storage efficiency or performance, or something else. However, libraries like protobuf and thrift are notoriously problematic with dependency convergence.

Also, I believe that during upgrades it is necessary to insure that no FATE transactions are pending or the upgrade aborts - so changes in serialization should not be a factor between upgrades.

In general, that is true for major releases, but not for patch/bugfix releases. I don't know off the top of my head what we've done for minor releases, or what our current upgrade code would expect before upgrading from 2.0 to 2.1. However, the more incompatibilities we avoid, the easier our upgrade code and testing paths are.

In any case, I think it should be fine to relocate UtilWaitThread, though I don't see a huge value in it. I'm just suggesting caution whenever relocating any object that has potential to be serialized. Classes in packages marked as "fate" have a higher likelihood of being serialized, but they are not the only objects that could be affected by class relocations.

ctubbsii avatar Feb 01 '21 20:02 ctubbsii

May want to look at flatbuffers if you have not done so already.

dlmarion avatar Feb 01 '21 20:02 dlmarion

In any case, I think it should be fine to relocate UtilWaitThread, though I don't see a huge value in it. I'm just suggesting caution whenever relocating any object that has potential to be serialized. Classes in packages marked as "fate" have a higher likelihood of being serialized, but they are not the only objects that could be affected by class relocations.

I agree. I only recommended its use as an easy way to standardize our handling of the sleep and catch IterruptedException use case, one of the situations this ticket aims to fix.

milleruntime avatar Feb 02 '21 14:02 milleruntime

Caller Code (TabletServer), consider using the returned ScheduledFuture from scheduleWithFixedDelay to cancel running tasks or prevent new ones if an interrupt occurs. This allows for a more graceful termination (closing resources) and improved responsiveness. Faced the same problem during my web development career but training at Full Stack Web development course helped me.

DataSciencemumbai avatar May 20 '24 06:05 DataSciencemumbai