`resume`/`callback` in `cont`, `async`, etc. should return a `Boolean`
This Boolean should be similar to the Boolean returned when completing a Deferred i.e. it should indicate that this invocation of the resume/callback is the one and only one that actually completed the callback. Everyone else should get false.
This information about whether the callback succeeded would help avoid leaks in various concurrent state machines. Here are two that I encountered recently:
- https://github.com/typelevel/cats-effect/pull/3465#discussion_r1133340355
- https://github.com/typelevel/cats-effect/pull/3549#issuecomment-1519335094
I think this is a good idea. I suspect it would('ve) make fixing #3603 almost trivial. (I assume the semantics would be that if a cancel won, the callback would return false.)
Something worth thinking about is how would this affect Deferred#complete: it already returns a bool, but that means a different thing.
To answer my own question: I don't think this should affect Deferred#complete. That is a similar, but different thing. With a Deferred, there is 0 or more subscribers, and it is "normal" not to wake up some of them. With async/cont/etc., there is one "subscriber": the fiber itself; and it is important whether we woke it up.
I've started working on this (#3917), and the desired semantics are not entirely clear.
What should resume return (true or false), if resume wins the race? get did not run (probably yet), and resume sets the result. This seems like it should be true, but I think nothing stops the fiber from being cancelled right after this. So it's not really true or false, but "we don't know (yet)". Waiting in resume is obviously not an option. (Also note, that get may not even be sequenced, so it may never run.)
Any ideas?
but I think nothing stops the fiber from being cancelled right after this
Really? I would say that's a bug. If resume completes with a value (and returns true), then get must resume with that value. If you can cancel between these steps, then you can lose values.
But it depends what you mean by "right after this". If you mean steps after the get, that is fine.
Well, how about something like this:
new Cont[IO, Int, String] {
def apply[F[_]](implicit F: Cancelable[F]) = { (resume, get, lift) =>
lift(IO(resume(Right(42)))) >> F.canceled >> get.map(_.toString)
}
}
With the current implementation this self-cancels (I think this is expected). But I think it could be cancelled at any of the >>s also. Which also seems "fine" (if a bit strange).
I think the point is, that currently when resume executes, it may have no way of knowing what will happen with the value it writes. E.g., because get did not run yet, or in fact it may never will run.
Ok, thanks, I see your point now.
I think the point is, that currently when
resumeexecutes, it may have no way of knowing what will happen with the value it writes.
I think semantically the true means that if get runs, then this will be the value it returns.
I'm not sure that helps with the "lost value" problem. For example, let's see asyncCheckAttempt (async is of course similar):
def apply[G[_]](implicit G: MonadCancel[G, Throwable]) = { (resume, get, lift) =>
G.uncancelable { poll =>
lift(k(resume)) flatMap {
case Right(a) => G.pure(a)
case Left(Some(fin)) => G.onCancel(poll(/*HERE*/get), lift(fin))
case Left(None) => get
}
}
}
Let's say, that resume runs first, writes the result, gets a true (if get runs, it will return that result). But then the fiber gets cancelled right before get is sequenced (marked with HERE in the code above)... and the result is lost. (But whoever called resume thinks it isn't lost.)
Arguably, this is a "bug" in asyncCheckAttempt. (Well, not in the current version, but in the version where cb returns a Boolean.) I'm still thinking how to fix this bug, but I'm also not sure the bug is really in asyncCheckAttempt...
Hmm, thanks for writing that out. Actually if I understand correctly, this seems to be a very general problem (not specific to asyncCheckAttempt).
- to avoid leaking values, we must be able to guarantee that
getis sequenced - to support cancelation,
getmust be cancelable
Currently there is no way to have a cancelable get that is guaranteed to be sequenced 🤔 I'm not even sure if such a thing is possible, without a tryGet and a combinator like onCancelMasked from https://github.com/typelevel/cats-effect/issues/3474#issuecomment-1509872654 to "defer" a cancelation signal to avoid leaking values.
dumb-idea-of-the-day
poll(/*HERE*/get)
what if we just disallowed cancelation there (i.e. change the implementation of uncancelable/poll)?
this seems to be a very general problem (not specific to asyncCheckAttempt).
Yeah. Although, I suspect that if we can solve it for asyncCheckAttempt, we can probably solve it for others too. The typical usages of cont seem to be quite similar, and asyncCheckAttempt seems to be the most general one.
Also worth noting, that this whole problem only occurs, if resume wins the race. If get runs first and suspends the fiber, that's fine. If the fiber is cancelled before resume runs, I think that's also fine (there is a cancellation check in resume).
what if we just disallowed cancelation there (i.e. change the implementation of uncancelable/poll)?
Yeah, I thought about that, but that has... consequences. If we look at this:
poll(/*1*/ x /*2*/)
The point 2 is already not a cancellation point. If we also make 1 not a cancellation point, that makes poll a no-op in certain cases (e.g., if x is pure, or a single delay, or something like that). This might not be a deal breaker, but I'm not sure.
that makes
polla no-op in certain cases (e.g., ifxispure, or a singledelay, or something like that). This might not be a deal breaker, but I'm not sure.
Yeah, honestly that seems fine to me 🤷 but I won't claim to foresee all the consequences :)
Let's say, that resume runs first, writes the result, gets a true (if get runs, it will return that result). But then the fiber gets cancelled right before get is sequenced (marked with HERE in the code above)... and the result is lost. (But whoever called resume thinks it isn't lost.)
Bolded the relevant bit here.
So in this scenario, whoever called resume definitely knows the value is lost, because their finalizer is getting called. Consider this from the user side. I'll use async instead since it's a simpler signature:
IO.async(cb => IO(/* things that call cb */ Some(IO(/* finalizer */))))
Okay so if the code at /* finalizer */ runs in the above, we know definitively that any value (or error) which had been passed to cb is lost. At this point, we're responsible for sorting that out, either by recovering it somehow or getting an upstream to retry, etc. Whatever it takes really.
In some cases, this isn't reliably possible without sacrificing other system properties, so instead what often happens is people treat this as a nondeterminism (even though it isn't one) and dodge the recovery by instantiating async to Unit and solely using it as a latching mechanism (see: the new Queue implementations, for example). This avoids having to create weird recovery protocols.
But overall, this is something that users can cleanly respond to, unlike the cancelable gap at the end of poll regions, to which they have no response.
whoever called resume definitely knows the value is lost, because their finalizer is getting called
That is a very good point. (It would be "nicer", if the return value could give this information directly, but I'm beginning to think that might not be possible.)
Now the question is: could we build something (useful) with both the Boolean and the finalizer, which we couldn't already build with just the finalizer? (Because if not, then this issue is not really needed.)
So, about never cancelling (observing cancellation?) right inside poll, i.e.,
poll(/* we never cancel HERE */ x)
It is possible. In my draft PR (https://github.com/typelevel/cats-effect/pull/3917), I did it (not in a very elegant way, as it's just an experiment for now). It seems to work. A few things needed fixing (notably in Resource), but nothing too terrible.
Edit: of course, this is yet another violation of the functor laws I think. The existing one is for poll(x /* HERE */), so it's not surprising.