chime icon indicating copy to clipboard operation
chime copied to clipboard

java.lang.InterruptedException race in chiming functions.

Open dazld opened this issue 4 years ago • 42 comments

There seems to be a race between closing a chime and the chiming function invocation that can lead to a java.lang.InterruptedException being thrown when functions are awaiting something coming from, for example, a future, or have consumed a function that does this somewhere along the call stack.

It'd be nice if either the chiming function was never called, or any inflight functions were allowed to complete before closing.

Example:

(let [now (Instant/now)
      chiming (chime/chime-at (chime/periodic-seq now (Duration/ofSeconds 1))
                              (fn [time]
                                (log/info :future-task @(future (count {})))
                                (log/info :chime (str time)))
                              {:error-handler (fn [e]
                                                (log/error :chime-failed e)
                                                true)
                               :on-finished #(log/info :done)})]
  (Thread/sleep 1000)
  (.close chiming))

outputs:

2020-09-07 15:51:25.548 INFO  :future-task 0
2020-09-07 15:51:25.549 INFO  :chime 2020-09-07T14:51:25.547924Z
2020-09-07 15:51:26.550 INFO  :done
2020-09-07 15:51:26.550 ERROR :chime-failed #error {
 :cause nil
 :via
 [{:type java.lang.InterruptedException
   :message nil
   :at [java.util.concurrent.FutureTask awaitDone FutureTask.java 418]}]
 :trace
 [[java.util.concurrent.FutureTask awaitDone FutureTask.java 418]
  [java.util.concurrent.FutureTask get FutureTask.java 190]
  [clojure.core$deref_future invokeStatic core.clj 2300]
  [clojure.core$future_call$reify__8454 deref core.clj 6974]
  [clojure.core$deref invokeStatic core.clj 2320]
  [clojure.core$deref invoke core.clj 2306]
  ...
  [chime.core$chime_at$schedule_loop__13568$task__13572$fn__13573 invoke core.clj 91]
  [chime.core$chime_at$schedule_loop__13568$task__13572 invoke core.clj 90]
  [clojure.lang.AFn run AFn.java 22]
  [java.util.concurrent.Executors$RunnableAdapter call Executors.java 515]
  [java.util.concurrent.FutureTask run FutureTask.java 264]
  [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask run ScheduledThreadPoolExecutor.java 304]
  [java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1128]
  [java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 628]
  [java.lang.Thread run Thread.java 834]]}

Note, might have to run this a couple of times to see the exception, as it doesn't always happen.

dazld avatar Sep 07 '20 14:09 dazld

Thanks @dazld, will have a look :smile:

jarohen avatar Sep 09 '20 17:09 jarohen

No biggie! Was taking chime for a test drive and these couple of things popped out. While we’re talking, would you mind pushing a release of the version with the IPending implementation? That threw me as well, and noticed it’s fixed in master.

dazld avatar Sep 09 '20 21:09 dazld

Ah, I see what's going on here - I'm afraid this is (currently, at least) intended behaviour within Chime. If a schedule is explicitly closed, we assume the system is shutting down in some way, and interrupt the scheduling thread to give it a chance to cleanly close any resources it's opened. Interrupting a thread that's currently waiting on a deref causes the deref to throw the InterruptedException.

With that in mind, we can repro it reliably by making the future (Thread/sleep 500), say.

One example of prior art here would be the difference between ExecutorService's shutdown and shutdownNow methods - Chime behaves like the latter. I'll preserve this for close, at least, to avoid breaking users that depend on it.

I'll consider what it'd involve to add something like that to Chime, and what it's impact would be. In the meantime, maybe either a lock or a shared continue? flag would achieve the same aim?

(let [now (Instant/now)
      close-lock (Object.)
      chiming (chime/chime-at (chime/periodic-seq now (Duration/ofSeconds 1))
                              (fn [time]
                                (log/info :chime (str time))
                                (locking close-lock
                                  (log/info :future-task @(future (Thread/sleep 500)))))
                              {:error-handler (fn [e]
                                                (log/error :chime-failed e)
                                                true)
                               :on-finished #(log/info :done)})]
  (Thread/sleep 250)
  (locking close-lock
    (.close chiming)))

Re the 0.3.3 release - I've pushed a 0.3.3-SNAPSHOT up with the other fixes.

Thanks again for raising this issue :)

jarohen avatar Sep 12 '20 19:09 jarohen

The behaviour of .shutdownNow() (as seen from consumers) can be simulated with a cancelled? (AtomicBoolean) flag, can't it? Instead of calling (f time) true, you would call (when (false? (.get cancelled?)) (f time) true). Doesn't this make moving to .shutdown() possible? Yes, it would leave the task scheduled for one more time, but that time will not actually run. I'd say that's a sensible compromise in order to get rid of the racy nature of Interrupts. Just a thought...

jimpil avatar Dec 26 '20 17:12 jimpil

BTW, I tried out the approach described in my previous comment, and the only test that breaks is this (for obvious reasons):

FAIL in (test-cancelling-overrunning-task) (core_test.clj:111)
expected: (instance? InterruptedException (clojure.core/deref !error))
  actual: nil

Ran 18 tests containing 45 assertions.
1 failures, 0 errors.
Tests failed.

jimpil avatar Dec 27 '20 09:12 jimpil

A complementary approach would be to let consumers choose how the executor is shutdown (i.e. :orderly VS :abruptly), defaulting to the latter for backwards-compatibility. Here is a sketch:

(let [pool (Executors/newSingleThreadScheduledExecutor thread-factory)
      !latch (promise)
      cancelled? (atom false)
      shutdown* (case exec-shutdown 
                     :orderly (fn [] 
                                (reset! cancelled? true)
                                (.shutdown pool))
                     :abruptly (fn [] (.shutdownNow pool)))] ;; default

jimpil avatar Dec 27 '20 10:12 jimpil

I retract my last two comments! There is already a !latch that can be leveraged:

(when-not (realized? !latch)
   (f time)
   true)

This allows moving to .shutdown() w/o introducing anything new, and w/o breaking anyone. The only 'breaking' (if you want to call it that) change is that custom error-handlers no longer have to mimic the default one (wrt to checking for the InterruptException) - something which, to my surprise, is not documented. Isn't it a problem if a provided error-handler ignores the interrupt and returns true (in the current interrupt-based approach)?

[EDIT]: Actually it's not 'breaking', is it? The default error-handler (or any other that already mimics it) can stay as is - they just will never see an InterruptedException, so that check is effectively redundant (functionally equivalent to true). In the absence of interrupts altogether, the overall behaviour of chime doesn't change - does it?

jimpil avatar Dec 28 '20 17:12 jimpil

I'm trying to digest the following comment (seen a few messages above):

One example of prior art here would be the difference between ExecutorService's shutdown and shutdownNow methods - Chime behaves like the latter. I'll preserve this for close, at least, to avoid breaking users that depend on it.

This is an implementation detail, and users shouldn't depend on it - in fact I'm struggling to see how they even could. Users of chime come to it looking for a flexible scheduling primitive. The promise/guarantee of chime can be exhausted in single sentence - your function will be called at the times you provided, unless you manually cancel (and that's very powerful btw). As long as chime can uphold that promise, what constructs you use to implement it is sort of irrelevant. Users that depend on esoteric intricacies of some implementation should know better, but in the case of chime, with such a narrow API I honestly don't see how users can possibly be depending on the difference of shutdown VS shutdownNow(). The only thing that (indirectly) cares about it is the default error-handler which special-cases InterruptedException (somewhat sneakily I might add, as I can't find any mentions that this type of exception should be given any form of special attention - say by custom error handlers).

My point is that chime can keep its exact same api/functionality, while slightly changing its implementation, which today is centered around a non-deterministic/best-effort/racy interrupt-based approach (all because of shutdownNow()), and nobody will be able to tell the difference.

Looking back at the history of the project, I can see that !latch is a kind of a recent addition (as an attempt to make sure that the on-finished callback is called only once). This means that in its conception the code didn't have this notion of are-we-done? that the !latch later brought. In a world where you don't have this explicit notion, the best you can do is play with interrupts and hope for the best. In that context, I can totally see why the original implementation started out as it did (based on interrupts). It's actually pretty neat/clever if you think what it had to work with. However, you later did add the !latch and that gives you all the control you need to move away from interrupts, while keeping (I'd argue more reliably) the don't call f after cancel semantics that users care about.

Anyway, enough waffling...i think it's pretty clear how much I want to use chime... :+1:

jimpil avatar Dec 28 '20 23:12 jimpil

Hey @jimpil, thanks for the analysis - you've clearly gone quite deep into this :smile:

If I may, to summarise, it seems there's a couple of points here:

  1. "Users may want both shutdown and shutdownNow behaviour." Indeed, I recall reading that this was the reason that ExecutorService doesn't implement Closeable - because there's no good default here. Maybe we should expose both? I do think that interrupts are the right thing to use here, in the shutdownNow case, given this is how ExecutorServices work - but we could of course be clearer about it.
  2. "The default error handler knows to handle InterruptedException specifically, which is arguably too much knowledge" - agreed. I think you've pretty much arrived at the correct cause here - the error handler has been extracted out of what Chime did internally before, and has hung on to the previous implementation detail. I wouldn't be averse to a new handler here (deprecating the old one) with clearer semantics - maybe something closer to the underlying thread's 'uncaught exception handler' would be good prior art to lean on.

Have I understood it correctly? :)

James

jarohen avatar Dec 29 '20 17:12 jarohen

Hi @jarohen,

See my comments inline:

Users may want both shutdown and shutdownNow behaviour

We may arrive at this conclusion in the end, but it is kind of premature to consider this at this stage. We first need to understand the following:

  1. What is the public promise of chime?
  2. How is that achieved (i.e. what is shutdownNow contributing)?
  3. Is it an absolute guarantee, or is it more like a best-effort approach?
  4. Are there any alternatives that don't compromise the promise?

Assuming that we agree on chime's singular promise (i.e. I shall call your function at the specified times, until you manually cancel), it's an interesting exercise to go through the rest.

  1. How is that achieved (i.e. what is shutdownNow contributing)?

As I've already explained, when there is no global notion of done?, the only way you can implement the until you manually cancel part is via interrupts. Once you accept/embrace this compromise, then of course InterruptedException has to be promoted (so to speak), as that has effectively become the done? signal. In other words, the implementation makes perfect sense, assuming that interrupts are the only viable approach.

  1. Is it an absolute guarantee, or is it more like a best-effort approach?

I think you'll agree that relying on interrupts is inherently racy, and therefore by definition a best-effort approach

  1. Are there any alternatives that don't compromise the promise?

There hasn't always been one, but the moment you added the !latch you've essentially introduced a global notion of done? (a great name for the !latch btw). Guess what - you no longer need interrupts - you have your own, fully reliable way of communicating when things should stop. This alone, enables you to move to shutdown w/o lying (in the sense of not upholding your promise) to anyone.

Now, if after doing all that, you still feel that there are masochists out there who would actually prefer their core scheduling primitive to be racy/not-deterministic and centered around interrupts, despite chime never having made such a promise, and despite virtually no reason to go down that route, then OK sure perhaps you can consider exposing a racy shutdown option, and make it an official feature (or bug depending on your perspective). This is up to your discretion and taste. However, having a non-racy default (which remember, upholds the core promise) is a no-brainer.

I do think that interrupts are the right thing to use here, in the shutdownNow case, given this is how ExecutorServices work - but we could of course be clearer about it.

Yes indeed, if you go down the shutdownNow route you literally have no choice. Being clearer, and instructing users to always keep an eye out for InterruptedException in their error-handlers is super important, given the current implementation. I bet there are plenty of error-handlers out there that don't pay any attention to InterruptedException, but frankly I don't blame them. You can't expect everyone to have the time to go through the source code of every single lib they're using.

"The default error handler knows to handle InterruptedException specifically, which is arguably too much knowledge" - agreed.

Yes but this is kind of necessary. The real problem, given the current implementation is that it's not communicated to the users (e.g via the doc-string - it just mentions truthy VS falsy).

I wouldn't be averse to a new handler here (deprecating the old one) with clearer semantics - maybe something closer to the underlying thread's 'uncaught exception handler' would be good prior art to lean on.

There is no better/clearer error-handler given the current implementation. It will ultimately be an InterruptedException that is thrown (via shutdownNow). Whether you handle that via a global exception-handler (or not), doesn't change the fact that it is a racy interrupt. I would advise taking a step back, and asking yourself why does it need to be this way? (i.e. centered around interrupts)? And if there is indeed a reason that it has to be this way, how come it's not reflected in the tests (i.e. all tests pass when I switch from shutdownNow to shutdown)?

Hope that helps in your thinking process :+1:

jimpil avatar Dec 29 '20 18:12 jimpil

Thanks

An important meta point, first of all:

Now, if after doing all that, you still feel that there are masochists out there who would actually prefer their core scheduling primitive to be racy/not-deterministic and centered around interrupts, despite chime never having made such a promise, and despite virtually no reason to go down that route, then OK sure perhaps you can consider exposing a racy shutdown option, and make it an official feature (or bug depending on your perspective). This is up to your discretion and taste. However, having a non-racy default (which remember, upholds the core promise) is a no-brainer.

Please don't dismiss people with alternative opinions in this way on my repos. You have very valid points and suggestions that I'm more than willing to consider - there is no need to detract from them by using such language and tone.

jarohen avatar Dec 29 '20 20:12 jarohen

Please don't dismiss people with alternative opinions in this way on my repos. You have very valid points and suggestions that I'm more than willing to consider - there is no need to detract from them by using such language and tone.

You're right - I apologise for my poor choice of words, but I'm not actually dismissing anyone. I'm basically concluding that there may ultimately be value in accommodating those people (regardless of their opinions, or how they might look to me). By the way, I'm not a native speaker, but I totally get your point...I re-read my comment and i could have easily expressed the same point using different words. Point taken :+1:

jimpil avatar Dec 29 '20 20:12 jimpil

I think you'll agree that relying on interrupts is inherently racy

I disagree, I'm afraid. Could you elaborate?

There hasn't always been one, but the moment you added the !latch you've essentially introduced a global notion of done?

Guess what - you no longer need interrupts - you have your own, fully reliable way of communicating when things should stop.

I disagree with these too - that we have a 'global notion of done?', that we have a 'fully reliable means of communication without interrupts', and hence that we don't need them. What happens if the user wants to shut down the schedule while the job is currently running?

Openly, I'm a little discouraged by your outright dismissal of interrupts without an acknowledgement of under what circumstances they might be useful. It doesn't fill me with confidence that you've fully considered the benefits of the opposite position before rejecting it. Similarly, I am aware of their pitfalls - so please don't assume I introduced them without due thought and consideration, nor that I am not sufficiently experienced that I don't know of alternatives.

We may arrive at this conclusion in the end, but it is kind of premature to consider this at this stage. We first need to understand the following: ... Assuming that we agree on chime's singular promise (i.e. I shall call your function at the specified times, until you manually cancel), it's an interesting exercise to go through the rest.

Likewise. I have been working on/with Chime for ~7 years now. While I don't pretend to have gotten everything right, please do consider that I may have put some thought into it over that time! :)

jarohen avatar Dec 29 '20 21:12 jarohen

I disagree, I'm afraid. Could you elaborate?

The perils of the interrupt flag, what/how affects it, and the general pains of using it correctly have been discussed extensively, and I'd like to avoid focusing on that, as it's rather secondary to my main point. Their most useful feature is the fact that blocking operations like network IO, .sleep(), .wait() etc (things out of our control) attempt to abort/cleanup when they see one. But how does that relate to chime? Does chime aspire to attempt to abort a task that is happening 'right-now' (i.e. has already started)? Which brings me to my next point...

I disagree with these too - that we have a 'global notion of done?', that we have a 'fully reliable means of communication without interrupts', and hence that we don't need them. What happens if the user wants to shut down the schedule while the job is currently running?

A user who wants to stop a side-effecting task that has already started (e.g sending an email, writing a file etc), is by definition a user who doesn't care about how much of the side-effect actually happened. I know it seems like the opposite because he/she (seems to) wants for things to stop asap, but there is no way for them to actually know, unless they are the ones driving it. Moreover, what does it mean to stop the schedule for consumers? Are we not talking about the schedule as a whole (as opposed to a single scheduled task)? If so, then what happens in the real world? What happens when you try to cancel a recurring direct-debit on the same day (let alone same instant) it's supposed to go out? I don't know about your bank, but with mine what gets cancelled is the next instalment. What happens with Amazon when you try to cancel a subscription that has already been processed (a few days before it actually arrives), or after a parcel has been dispatched? If i remember correctly, in the latter case the option to cancel is not even there anymore - you have to go the refund route. Generally speaking, there is a cutoff point with these things, after which it simply doesn't make sense to cancel. In chime's case, that cutoff moment can/should(?) be considered to be the moment something starts to run. Anything else opens the door to potential leaks, corrupted-files, half-sent or abandoned http requests, emails that reached half their usual recipients, or god knows what else - things that should generally be avoided. If you want for user-code to be able to bail-out (say from a loop) when the thread is interrupted, that is kind of a different discussion, to which I'd say that the notions of closing VS cancelling should probably be distinct.

Openly, I'm a little discouraged by your outright dismissal of interrupts without an acknowledgement of under what circumstances they might be useful. It doesn't fill me with confidence that you've fully considered the benefits of the opposite position before rejecting it.

It's rather sad to hear this, as it proves that I didn't quite get the message across. I'm not opposing, nor dismissing interrupts in the general sense. I just don't think they provide a sound foundation to carry the unscheduling semantics of close. They are a great candidate for the cancelling/interrupting notion i mentioned above, but firstly that should be driven ideally only by user-code, and secondly it should be a distinct thing (feature if you will). As far as the benefits of the opposite position go, I was the first to admit/point-out that up until the (recent) addition of !latch, relying on shutdownNow was not just desirable, but also necessary as w/o it you would always have one more side-effect produced after un-scheduling (and that's clearly a problem).

Likewise. I have been working on/with Chime for ~7 years now. While I don't pretend to have gotten everything right, please do consider that I may have put some thought into it over that time! :)

I don't know what idea you got, but I consider the work you've done with chime simply excellent - from the core idea down to the actual implementation, and ultimately I am utterly grateful for the hours you've put in. The discussion we're having here is not to have a go at you, or your work. Quite the contrary I would say - I'm itching to use chime for serious work, and as a consequence I want to understand it and improve it (if possible of course). I am more than happy to be enlightened on anything you feel is important.

Finally, I was already wondering in the back of my head, but given how much more you've re-enforced your position around interrupts and shutdownNow, I just have to ask with a straight face, otherwise I won't sleep tonight :(. How is it possible that such a design center-piece can be swapped out w/o any (real) tests breaking?

Thanks again...

jimpil avatar Dec 29 '20 23:12 jimpil

It just occurred to me that the reify returned could implement Future :

Future
 (cancel [_ interrupt?]
   (when-let [^Future fut @current] ;; `current` is an atom which holds the current-job's future (as returned by `.schecule`)
      (and (not (.isCancelled fut)) 
           (.cancel fut interrupt?))))

Again, just a thought - I'll come back to this with a clean head tomorrow and re-evaluate...

jimpil avatar Dec 29 '20 23:12 jimpil

This is kind of neat and it also kind of addresses #30, if I'm understanding the request correctly:

ScheduledFuture
(cancel [_ interrupt?] ;; expose interrupt facilities (opt-in)
  (when-let [^ScheduledFuture fut @current]
    (and (not (.isCancelled fut))
         (.cancel fut interrupt?))))
(getDelay [_ time-unit] ;; expose remaining time until next chime (partly addresses #30)
  (when-let [^ScheduledFuture fut @current]
    (.getDelay fut time-unit)))

jimpil avatar Dec 30 '20 00:12 jimpil

I was just reading this this morning, after several years of reading the actual book. I am not going to copy-paste any sections in particular, as there is a lot of good information in there, but the key takeaway (for me at least), is that the right reaction to an interrupt, will ultimately depend on the actual method that is being interrupted. In other words, there is no one size fits all solution, and chime is certainly in no position to know what the semantics of the underlying chiming-fn are. Only the consumer knows that, hence only the consumer can make that decision.

jimpil avatar Dec 30 '20 09:12 jimpil

Hey, thanks for the clarifications.

I think you'll agree that relying on interrupts is inherently racy

I disagree, I'm afraid. Could you elaborate?

The perils of the interrupt flag, what/how affects it, ...

Ah, ok - apologies, I thought you were suggesting that the interrupt flag had/caused race conditions (the original subject of this issue). I definitely agree that interrupts have their perils - if we can get closer to the textbooks this can only be a good thing.

jarohen avatar Dec 30 '20 12:12 jarohen

If you want for user-code to be able to bail-out (say from a loop) when the thread is interrupted, that is kind of a different discussion, to which I'd say that the notions of closing VS cancelling should probably be distinct.

This is the essence of my original proposal - I recognise that there are two distinct use cases, and Chime currently only handles the shutdownNow case - you'd already convinced me that it should support both.

Generally speaking, there is a cutoff point with these things, after which it simply doesn't make sense to cancel. In chime's case, that cutoff moment can/should(?) be considered to be the moment something starts to run. Anything else opens the door to potential leaks, corrupted-files, half-sent or abandoned http requests, emails that reached half their usual recipients, or god knows what else - things that should generally be avoided.

IMO it's not really for Chime to decide this. In terms of opinions, it's a scheduler that takes a seq of times and a function, that's it. Beyond that, it's for Chime's users to determine how to handle errors, how to shutdown safely, etc. If you're supplying a long-running function (say, an overnight batch - the original use-case that led to Chime) that is interrupt-aware, you should be able to interrupt it.

Where I think we're short atm is the more graceful stop - the user should (but doesn't currently) have the option to stop the schedule after the currently running job. On balance, I'd say it'd be less intuitive to run the next schedule, even if you stopped it beforehand? In your examples with banks and Amazon, their systems may well have constraints which means they need to operate like this - they might have daily batch processing of DDs, or the parcel may well already be making its way through the logistics chain - but we don't need to impose such constraints on all Chime users, who may just be using the scheduler to send a daily email? If I stopped the 2am daily schedule at midday on the 30th, I wouldn't expect it to send an email at 2am on the 31st, say.

Finally, I was already wondering in the back of my head, but given how much more you've re-enforced your position around interrupts and shutdownNow, I just have to ask with a straight face, otherwise I won't sleep tonight :(. How is it possible that such a design center-piece can be swapped out w/o any (real) tests breaking?

Hopefully the above clarifies my thoughts here? If it helps, I wouldn't go so far as to call it a 'design center-piece' - it's not the core value of Chime, and I'm certainly open to iterating on it. We do have a test for the current behaviour, which I think you've run into - test-cancelling-overrunning-task - changing the semantics to shutdown breaks this test.

jarohen avatar Dec 30 '20 16:12 jarohen

I was just reading this this morning, after several years of reading the actual book.

I wasn't aware of this checklist - thanks :)

jarohen avatar Dec 30 '20 16:12 jarohen

Re ScheduledFuture. I like the idea in theory - in practice it seems to have a couple of tradeoffs?

  • getDelay - would it be preferable to have an API that returns the next time, and let the user calculate the delay, if need be? By the time we've returned the delay, it's already out of date.
  • cancel - this introduces the question of whether I'm cancelling just the current job or not, potentially? I mean, this is potentially a use-case - 'I'd like to cancel the current job but not the overall schedule' (indeed, a question you mention further up). I'm also not mad keen on boolean params in cases like these, I always have to look them up - which is why I'm drawn to something like shutdown/shutdownNow. The word 'shutdown' (to me, at least) is a clearer indication that we're shutting down the whole thing.

Maybe this could be a chime-specific protocol, but with semantics/naming taken from ExecutorService and friends?

jarohen avatar Dec 30 '20 16:12 jarohen

It's rather sad to hear this, as it proves that I didn't quite get the message across. I'm not opposing, nor dismissing interrupts in the general sense.

I don't know what idea you got, ...

FWIW, it was about a handful of sentences in your original message that gave an overall impression of being a little dismissive and patronising. I appreciate the clarifications, thanks - won't dwell on them any further. Happy to file under 'the perils of text-only communication' and move on with improving Chime :)

jarohen avatar Dec 30 '20 16:12 jarohen

This is the essence of my original proposal - I recognise that there are two distinct use cases, and Chime currently only handles the shutdownNow case - you'd already convinced me that it should support both.

First of all, it's great to hear that there is some form of convergence. Secondly, both this comment of yours, and a later one make it sound like chime aimed/aspired to accommodate interrupt-aware loops from day one. If that is indeed the case, should this be mentioned somewhere in the README, or even better the doc-string?

IMO it's not really for Chime to decide this. In terms of opinions, it's a scheduler that takes a seq of times and a function, that's it. Beyond that, it's for Chime's users to determine how to handle errors, how to shutdown safely, etc

It is great that you have this opinion, but unfortunately chime doesn't... As we've established, currently chime hardcodes abrupt shutdowns, and its default error-handler correctness comes down to the fact that it knows more than your average error-handler. Pretty much none of the decisions you mentioned are left for the user to make :(.

If you're supplying a long-running function (say, an overnight batch - the original use-case that led to Chime) that is interrupt-aware, you should be able to interrupt it.

I couldn't agree more! You should definitely be able to interrupt it, but you should be explicit about what you're trying to do. The semantics of close have to be macroscopic (i.e. if chime-at means 'schedule', close on the return value must mean 'unschedule'). Trying to interrupt something that is (maybe) happening 'right-now' has little to do with the overall schedule (in my head). As i think you've correctly identified yourself, conflating the two in a single notion of cancel leads to compromises.

Where I think we're short atm is the more graceful stop - the user should (but doesn't currently) have the option to stop the schedule after the currently running job. On balance, I'd say it'd be less intuitive to run the next schedule, even if you stopped it beforehand?

I'm not following 100% what you mean here so let's try to use a minimal example using 3 successive times (t1, t2, t3).

  • User gracefully shutdowns the pool (somehow) and it happens to fall exactly on t1 => job at t1 finishes and pool shuts down
  • User gracefully shutdowns the pool sometime between t1 and t2 => job remains scheduled to run at t2 (but won't actually execute when the time comes as it will be protected by the latch)

Do you see any problems with the above? What do you find less intuitive?

In your examples with banks and Amazon, their systems may well have constraints which means they need to operate like this - they might have daily batch processing of DDs, or the parcel may well already be making its way through the logistics chain - but we don't need to impose such constraints on all Chime users, who may just be using the scheduler to send a daily email?

The only purpose of these examples was to showcase how tricky/leaky it is to express things like i want to cancel something that is already happening. A parcel having started to make its way through the logistics chain is not too different than having opened a Socket.

If I stopped the 2am daily schedule at midday on the 30th, I wouldn't expect it to send an email at 2am on the 31st, say.

Again, I couldn't agree more, but it kinda worries me that you have to state this at all, so could you elaborate if you don't mind? Have I said, or proposed something that would cause this?

We do have a test for the current behaviour, which I think you've run into - test-cancelling-overrunning-task - changing the semantics to shutdown breaks this test.

The only aspect of that test that fails is the (instance? InterruptedException @!error), which is simply the result of the error-handler never being called (because there was no InterruptedException). How is that meaningful breakage?

I wasn't aware of this checklist - thanks :)

This book is a goldmine :+1:

getDelay - would it be preferable to have an API that returns the next time, and let the user calculate the delay, if need be? By the time we've returned the delay, it's already out of date.

My understanding from the Java docs is that this is calculated on-the-fly every time you call it, so it's always correct. That said, I would need to look at the actual source to be 100% sure. To be clear, I'm not opposing what you suggested - I'm just saying that there is a good chance this does exactly what a consumer might want (to know how long until the next chime). After all, he/she does have access to the times (he/she provided them), and there is a utility fn (without-past-times) to use against them if someone wants to find out all the upcoming chime times.

cancel - this introduces the question of whether I'm cancelling just the current job or not, potentially? I mean, this is potentially a use-case - 'I'd like to cancel the current job but not the overall schedule' (indeed, a question you mention further up). I'm also not mad keen on boolean params in cases like these, I always have to look them up - which is why I'm drawn to something like shutdown/shutdownNow. The word 'shutdown' (to me, at least) is a clearer indication that we're shutting down the whole thing.

No such dilemma introduced. If you can call both close and cancel on something, then they have to mean different things. If you accept/agree that close means 'unschedule the whole thing', then cancel can only ever mean abort the current thing (potentially with interrupts). Doing both means calling both (in which order is again a different question). The matter of boolean params is kind of superficial I find. One could simply call clojure.core/future-cancel (or any other custom wrapper) and forget about the boolean param.

Maybe this could be a chime-specific protocol, but with semantics/naming taken from ExecutorService and friends?

I haven't explored this idea, simply because I didn't need to. You have already made the decision to implement AutoCloseable, and ScheduledFuture not only has the abstractions i needed, but it also happens to be the type .schedule returns! Staying within the java.util.concurrent abstractions seems like a good idea to me, but as I said i haven't explored your idea so take everything with a grain of salt.

jimpil avatar Dec 30 '20 19:12 jimpil

Re getDelay - w/o having looked at the source yet, this SO post kind of fills me with confidence regarding what I said earlier.

jimpil avatar Dec 30 '20 20:12 jimpil

First of all, it's great to hear that there is some form of convergence. Secondly, both this comment of yours, and a later one make it sound like chime aimed/aspired to accommodate interrupt-aware loops from day one. If that is indeed the case, should this be mentioned somewhere in the README, or even better the doc-string?

Yep, let's add it to the doc-string of the abrupt shutdown :+1:

It is great that you have this opinion, but unfortunately chime doesn't... As we've established, currently chime hardcodes abrupt shutdowns, and its default error-handler correctness comes down to the fact that it knows more than your average error-handler. Pretty much none of the decisions you mentioned are left for the user to make :(.

Right - propose that we fix this by exposing a separate graceful shutdown, and take the knowledge of the InterruptedException away from the error handler.

I couldn't agree more! You should definitely be able to interrupt it, but you should be explicit about what you're trying to do. The semantics of close have to be macroscopic (i.e. if chime-at means 'schedule', close on the return value must mean 'unschedule'). Trying to interrupt something that is (maybe) happening 'right-now' has little to do with the overall schedule (in my head). As i think you've correctly identified yourself, conflating the two in a single notion of cancel leads to compromises.

Agreed, close should close the whole schedule. Think it'll be clearer to expose two new shutdown methods (à la ExecutorService) - if we then alias close to shutdown-now (and make it clear that that's what it does/has done all along) we can keep it backwards compatible.

While I remember, I haven't yet considered the impact of this on the core.async wrapper - TODO :thinking:

I'm not following 100% what you mean here so let's try to use a minimal example using 3 successive times (t1, t2, t3).

* User gracefully shutdowns the pool (somehow) and it happens to fall exactly on `t1` => job at `t1` finishes and pool shuts down
* User gracefully shutdowns the pool sometime between `t1` and `t2` => job remains scheduled to run at `t2` (but won't actually execute when the time comes as it will be protected by the latch)

Do you see any problems with the above? What do you find less intuitive?

If I stopped the 2am daily schedule at midday on the 30th, I wouldn't expect it to send an email at 2am on the 31st, say.

Again, I couldn't agree more, but it kinda worries me that you have to state this at all, so could you elaborate if you don't mind? Have I said, or proposed something that would cause this?

Right - it's the 'job remains scheduled to run at t2' that got me here but I see you're covering that with the latch - fair. Might be worth shutting down the scheduler eagerly though, could be a while til the next run.

We do have a test for the current behaviour, which I think you've run into - test-cancelling-overrunning-task - changing the semantics to shutdown breaks this test.

The only aspect of that test that fails is the (instance? InterruptedException @!error), which is simply the result of the error-handler never being called (because there was no InterruptedException). How is that meaningful breakage?

Ah, ok - yep. As a first pass, this test could check that the sleep is interrupted, rather than relying on the error handling behaviour. After that, there's probably better ways of expressing the desired behaviour too.

My understanding from the Java docs is that [getDelay] is calculated on-the-fly every time you call it, so it's always correct. That said, I would need to look at the actual source to be 100% sure. To be clear, I'm not opposing what you suggested - I'm just saying that there is a good chance this does exactly what a consumer might want (to know how long until the next chime).

It's always correct, but only at the exact moment ~it returns~ it's calculated - after that, it's out of date, and the user doesn't know the 'now' basis that was used in the calculation. If we pass the raw date back to the user instead, they're in charge of choosing a fixed 'now' to make that calculation.

After all, he/she does have access to the times (he/she provided them), and there is a utility fn (without-past-times) to use against them if someone wants to find out all the upcoming chime times.

We've fixed an issue previously (#10) around holding on to the head of the lazy seq - I wouldn't want to ask the user to do the same, especially if it's infinite. We have the tail of the sequence available, and could expose it - this would be particularly suitable if the user only wanted the next time, as it wouldn't require evaluating any more of the sequence.

Let's take that one to #30 :)

cancel - this introduces the question of whether I'm cancelling just the current job or not, potentially? I mean, this is potentially a use-case - 'I'd like to cancel the current job but not the overall schedule' (indeed, a question you mention further up). I'm also not mad keen on boolean params in cases like these, I always have to look them up - which is why I'm drawn to something like shutdown/shutdownNow. The word 'shutdown' (to me, at least) is a clearer indication that we're shutting down the whole thing.

No such dilemma introduced. If you can call both close and cancel on something, then they have to mean different things. If you accept/agree that close means 'unschedule the whole thing', then cancel can only ever mean abort the current thing (potentially with interrupts). Doing both means calling both (in which order is again a different question). The matter of boolean params is kind of superficial I find. One could simply call clojure.core/future-cancel (or any other custom wrapper) and forget about the boolean param.

Your point about calling .close on the schedule object also applies to calling (.cancel schedule), I think? As in, both could legitimately mean 'unschedule the whole thing'. There's probably a name here that removes all doubt. Tbh, we don't have to introduce this in the first increment - it's an additive change, can always come after.

jarohen avatar Dec 30 '20 21:12 jarohen

Hi there,

Don't be alarmed by the out-of-order responses - I just think architectural decisions will greatly inform the implementation, so I'd like to start with those.

Your point about calling .close on the schedule object also applies to calling (.cancel schedule), I think? As in, both could legitimately mean 'unschedule the whole thing'. There's probably a name here that removes all doubt. Tbh, we don't have to introduce this in the first increment - it's an additive change, can always come after.

From an architectural stsandpoint, what we are trying to do here is to separate the (macroscopic) notion of a schedule from the (microscopic) notion of an individual task, and ultimately provide the end-user sane semantics for managing both (depending on his needs which only he knows). Looking at chime from the outside, it only exposes something Closeable. Exposing something also cancellable which is assigned clear semantics (and those communicated in the doc-string) is the cleanest way to achieve this separation, and in this case particularly convenient both for you (the lib author) and the end-user, simply because this is exactly what .schedule() returns (you can't possibly give any more control to the end user). To me (a non-native speaker), the naming also falls into place kind of perfectly. (close schedule) means 'gracefully unschedule', (cancel schedule) means 'cancel the next job (potentially abruptly as it may have started)', and (doto (close schedule) (cancel true)) means do both (kind of what chime tries to do today by default). But even if my sense of English is too liberal, you can always have two little wrappers named differently. The fact of the matter is that exposing the underlying ScheduledFuture firstly empowers end-users, and secondly maps perfectly to the this new notion we need - the next (potentially already started) job ).

We've fixed an issue previously (#10) around holding on to the head of the lazy seq - I wouldn't want to ask the user to do the same, especially if it's infinite. We have the tail of the sequence available, and could expose it - this would be particularly suitable if the user only wanted the next time, as it wouldn't require evaluating any more of the sequence.

While the issue with lazy-seqs and holding their head is definitely there, and chime users should always be vigilante and aware, I don't think it's worth fixating on it, as holding the head of an infinite sequence is a bad idea in general - not just in chime. It's up to the user to re-generate the sequence of times, rather than re-using it (hence keeping a reference to it), if he wanted this feature I described. This is common practice/thinking amongst Clojure developers - you simply don't hold the head of a lazy-seq. I can kind of see the danger, but at the same time I can also see a perfectly valid way of achieving maximum flexibility w/o adding/exposing anything new.

It's always correct, but only at the exact moment it's calculated - after that, it's out of date, and the user doesn't know the 'now' basis that was used in the calculation. If we pass the raw date back to the user instead, they're in charge of choosing a fixed 'now' to make that calculation.

You've lost me here, but that could be down to me claiming that .getDelay partly addresses issue 30, which in-turn may have led you down the wrong thinking path. Let me try again...the only thing that .getDelay offers is a way to reliably ask the question how long until the next chime?. It is reliable because it is asking the source (the actual ScheduledFuture). In that sense the only now basis that makes sense is when the method ends up being called. Similarly, it cannot be out-of-date. It is returned to whoever asked for it, presumably for him/her to use it immediately. Why, for instance, don't you worry that (.between ChronoUnit/MILLIS (now clock) time) is 'out-of-date' by the time .schedule() sees it?

Right - it's the 'job remains scheduled to run at t2' that got me here but I see you're covering that with the latch - fair. Might be worth shutting down the scheduler eagerly though, could be a while til the next run.

It may be a long time, but it may not be...what does 'long-time' mean in this context anyway? And how informative this ultimately is? From a pure scheduling perspective, I'd argue that the thing that matters is that there are no more chimes. From a pure resource-management perspective, in theory you're right - the sooner we can get rid of resources that are not ultimately going to be used, the better (especially when talking about current heavy-weight Thread objejcts). However, I refuse to let that 'nice-to-have' pollute my thoughts of correctness. In fact, having seen the latest talks on Loom, I strongly believe that we will eventually all move away from this obsession to reuse as few threads as we possibly need etc etc. In such a world, where Threads are so cheap that pooling them becomes an anti-pattern, this whole resource-management concern you alluded to becomes literaly a non-issue.

I'll come back to you with implementation ideas tomorrow...As to how this all interacts with core.async I'll admit that i haven't looked at that at all. I tend to run all tests after every single change i make, trusting that there is good (or at least reasonable) test coverage. So far so good...

jimpil avatar Dec 30 '20 23:12 jimpil

(close schedule) means 'gracefully unschedule', (cancel schedule) means 'cancel the next job (potentially abruptly as it may have started)', and (doto (close schedule) (cancel true)) means do both (kind of what chime tries to do today by default).

While that is one valid interpretation, there is enough ambiguity here that I don't want to use these names. Let's go with shutdown and shutdown-now, given their prior art in ExecutorService. Let's also maintain close as before, now as an alias to shutdown-now and document that it follows the same semantics.

You've lost me here, but that could be down to me claiming that .getDelay partly addresses issue 30, which in-turn may have led you down the wrong thinking path. Let me try again...

This again comes across dismissive and patronising, as if the only way I'm able to think is when I'm being guided by you. I've given you the benefit of the doubt on a number of occasions, but seriously, please make this the last time.

Similarly, it cannot be out-of-date. It is returned to whoever asked for it, presumably for him/her to use it immediately.

I'm suggesting to give the user the raw data, and letting them make calculations based on it should they wish. If you give the users the raw data, they're free to make the calculation immediately, or they're free to make it later.

In any event, this isn't a blocker for this issue - let's either move it to #30 as requested or drop it for now.

Why, for instance, don't you worry that (.between ChronoUnit/MILLIS (now clock) time) is 'out-of-date' by the time .schedule() sees it?

In this case, I am the user - and I am choosing to make this calculation as close as possible to when it's used. I'd prefer to pass an Instant or equivalent, but unfortunately that's not part of the ScheduledExecutorService API.

From a pure scheduling perspective, I'd argue that the thing that matters is that there are no more chimes.

Agreed.

From a pure resource-management perspective, in theory you're right - the sooner we can get rid of resources that are not ultimately going to be used, the better (especially when talking about current heavy-weight Thread objects). However, I refuse to let that 'nice-to-have' pollute my thoughts of correctness.

Are you suggesting that closing the thread pool when we know the schedule's shut down is somehow 'incorrect'? If not, assuming it doesn't introduce excessive complexity, why would we not clean up when we can?

In fact, having seen the latest talks on Loom, I strongly believe that we will eventually all move away from this obsession to reuse as few threads as we possibly need etc etc. In such a world, where Threads are so cheap that pooling them becomes an anti-pattern, this whole resource-management concern you alluded to becomes literally a non-issue.

I'm also looking forward to Loom, and agree that it'll change the way we all think. However, on the understanding that we're not precluding ourselves from moving that way in the future as and when it becomes widely available, I don't think it applies in our current context. Even when Loom does become widely available, we still have to consider that there'll be users of Chime who aren't in a position to make that migration for a while.

jarohen avatar Dec 31 '20 10:12 jarohen

While that is one valid interpretation, there is enough ambiguity here that I don't want to use these names. Let's go with shutdown and shutdown-now, given their prior art in ExecutorService. Let's also maintain close as before, now as an alias to shutdown-now and document that it follows the same semantics.

Imagine for a second that chime-at included something like the following in its doc-string:

Returns an object which:
1. Fully implements `AutoCloseable` reflecting the enduring/macroscopic aspect of the *entire schedule* 
2. Partly implements `ScheduledFuture` reflecting the ephemeral/microscopic aspect of the *next-task*
Closing the entire schedule leads towards a graceful shutdown of the underlying `ExecutorService` (i.e. unschedule), 
whereas cancelling the next-task allows for (potentially abrupt) interruption. 
Performing both is a perfectly valid option. 

Out of curiosity, would you still consider the returned object semantically ambiguous, after reading this? If so, that's fair enough - I don't have any more arguments on this.

This again comes across dismissive and patronising, as if the only way I'm able to think is when I'm being guided by you. I've given you the benefit of the doubt on a number of occasions, but please make this the last time.

I honestly don't see how this comes off as dismissive and/or patronising. I'm admitting that I didn't quite follow your comment, but also that it could be my own fault. I have no problem apologising, but not for something I don't even understand. In fact, I'm starting to feel that you're the one being overly defensive and somewhat dismissive. Communicating such deep/complex matters is not easy, and I'm making a big effort to to be both polite and informative here. If you aren't finding this useful or informative we can stop at any time - just let me know. After all we're both spending valuable time from our lives...

I'm suggesting to give the user the raw data, and letting them make calculations based on it should they wish. If you give the users the raw data, they're free to make the calculation immediately, or they're free to make it later.

Again, fair-enough if you think this is helpful. My point is that that raw-data the user provided in the first place. To say that you're going to expose what the user gave you, just sounds weird (to me).

In this case, I am the user - and I am choosing to make this calculation as close as possible to when it's used. I'd prefer to pass an Instant or equivalent, but unfortunately that's not part of the ScheduledExecutorService API.

Again out of curiosity, what makes you think that an end-user will have a dramatically different-use case (when calling .getDelay)?

Are you suggesting that closing the thread pool when we know the schedule's shut down is somehow 'incorrect'? If not, assuming it doesn't introduce excessive complexity, why would we not clean up when we can?

If there was a way to have a graceful shutdown AND immediate cleanup, that would be just great! However, that is part of the conundrum with shutdown VS shutdownNow, isn't it? What I'm suggesting is simply that I wouldn't give up the graceful shutdown, in order to have immediate-cleanup, as it is kind of a secondary concern for a scheduling construct. If you can come up with a reliable way of achieving both, I would be very excited to hear/review.

I'm also looking forward to Loom, and agree that it'll change the way we all think. However, on the understanding that we're not precluding ourselves from moving that way in the future as and when it becomes widely available, I don't think it applies in our current context. Even when Loom does become widely available, we still have to consider that there'll be users of Chime who aren't in a position to make that migration for a while.

Sure, but it does highlight that this notion of resource-management (for threads), is only ever going to decrease in importance. People who haven't made the transition, should of course be accommodated, but we're not talking about breaking anyone's program here. We're talking about leaving a couple of objects in memory. As I mentioned, if you can come with a solution for that, I'm looking forward to hear about it.

jimpil avatar Dec 31 '20 11:12 jimpil

Out of curiosity, would you still consider the returned object semantically ambiguous, after reading this? If so, that's fair enough - I don't have any more arguments on this.

Am I understanding correctly that you propose that we change close to be the graceful shutdown, so that cancel becomes the abrupt shutdown? If so, this would be a breaking change.

I agree that we need to make the doc-string absolutely semantically clear - I additionally think that unambiguous names would be preferable. close (cancel to a lesser degree) isn't unambiguous, and I don't want to leave the user needing to reason that 'there's a cancel, so close must behave like this' - I'd rather they were independently clear.

jarohen avatar Dec 31 '20 16:12 jarohen

Again, fair-enough if you think this is helpful. My point is that that raw-data the user provided in the first place. To say that you're going to expose what the user gave you, just sounds weird (to me).

This has the tradeoff of the user keeping hold of the either the original seq (or the seq generator, passing it around with the schedule.

Again out of curiosity, what makes you think that an end-user will have a dramatically different-use case (when calling .getDelay)?

I don't - I just think that passing back the delay means that users pretty much have to use it immediately; passing them the raw time gives them the option of using it immediately or later.

Maybe I'm missing some value here regarding re-purposing the ScheduledFuture interface. It seems that we're introducing a little mud by saying 'the returned instance reflects both the overall schedule and the next invocation'. Do we get benefits from coupling Chime to it, returning a value that implements it? Are there things that a user could do with the Chime schedule if only it implemented ScheduledFuture, say?

Similarly, if ScheduledFuture didn't have getDelay, would we be considering exposing that in that form?

jarohen avatar Dec 31 '20 16:12 jarohen