orleans icon indicating copy to clipboard operation
orleans copied to clipboard

[Proposal] Support long-running grain calls using a heartbeating mechanism

Open ReubenBond opened this issue 2 years ago • 16 comments

It's common for developers to want long-running grain calls, eg calls which routines run for 10s of seconds or an arbitrarily long time.

Currently, there are two ways to accomplish this, and both have significant detractors:

  1. Configure SiloMessagingOptions.ResponseTimeout and ClientMessagingOptions.ResponseTimeout to some a value large enough to cover the longest call. This has the downside that all calls have the same timeout and therefore they suffer the consequences (eg, deadlocks take longer to unwind, client calls to a failed silo may take much longer to fault, leaving the client waiting on a response which will never arrive).
  2. Use a polling/callbacks pattern: work is started in the background using one method call and the client either continuously checks for progress using an additional call (maybe a single method is used and invoked repeatedly) or registers a callback. These techniques add complexity to both the grain code and the client code.

Proposal

We introduce support for arbitrarily long-running calls by leveraging a heart-beating mechanism.

Mechanism

#6672 added functionality to send updates for long-running requests to callers. The original purpose was to aid in diagnostics for blocked calls (eg, synchronous blocking, slow database connection, request deadlocks). This mechanism can be used to extend the timeout for a request. It could be limited to messages which are allowed to be long-running (i.e., dependent on a policy)

Policy

We can consider the policy for long-running calls separately. Here are some options, which aren't all mutually exclusive:

  1. All calls are always allowed to be long-running
  2. A global configuration allows all calls to be long-running
  3. Interleaved/reentrant calls are allowed to be long-running
  4. Grains can request additional time to process a call programmatically. It would seem that RequestContext is the appropriate place to interact with such functionality, but that's TBD
  5. Grain interface methods marked with a [LongRunning] attribute are allowed to be long-running. In terms of implementation, this could result in an InvokeMethodOptions value being set which tells the caller to extend the timeout when a status update is received. This is likely the easiest conservative design

Considerations

  • How should this interact with message expiration? One approach would be to prevent expiry if a message has reached the invocation stage, i.e, do not drop responses for [LongRunning] messages which have expired. We could extend that to all messages (not just [LongRunning]) with little/no negative effect.
  • We do not currently have a general-purpose cancellation mechanism for calls. We have support for GrainCancellationToken, but that is not ideal: it is not pay-for-what-you use and currently requires scanning all arguments, which therefore boxes every argument in order to perform the type check. It was implemented with what was available at the time, but we have completely rewritten our RPC system since then and could add support for grain cancellation tokens. It seems that this proposal touches on the same area, since it's about timeouts (cancellation, essentially - would it be useful to cancel abandoned calls automatically?). Using RequestContext to flow a CancellationToken down the call chain seems appropriate. We could also support having a single CancellationToken argument in grain interface method signatures. That could then be flowed to the RequestContext, but it's obvious that there is room for gotchas, like when both an CancellationToken argument and a RequestContext.Canceled token is present, but the former isn't flowed by the developer down the chain and the latter is implicitly flowed. We should expand in another issue/proposal.
  • How should this work for [OneWay] calls? Currently, we do not register [OneWay] calls with the local callback collection, so they are entirely untracked by the caller (since timeouts are not possible). If there is a programmatic way to extend call duration, that information would be ignored by the caller. A similar issue exists, for cancellation: if cancellation is lazy (eg, waits until the next heartbeat) then there would be no rendezvous point.

Related issues

  • #4328

ReubenBond avatar Mar 25 '22 15:03 ReubenBond

I don't see any reference to stream message recipients execution duration limits. Currently we are using SMS, so obviously the grain call semantics apply. However we clearly need to be able to control the execution time limits in this use case as well.

Having control over the stream provider time execution would be too coarse, but that might be good enough.

amccool avatar Mar 25 '22 17:03 amccool

Great point, @amccool. Streams are conceptually separate from grain method calls, but they use calls under the hood. It seems to me that the grain extension interface which Orleans uses under the covers to make stream message delivery work should be marked as long-running.

This is likely true for Reminders, too. I can see an argument against marking Reminder (and perhaps Streams) methods as [LongRunning]: there is a chance that deadlocks can occur (depending on what the application code does), or that one long-running reminder would block another (perhaps short-running) reminder because the method/grain is not interleaving/reentrant. It might not be an issue, and it might be solvable (eg, using cycle detection for long-running calls, and allowing 'interleaving'/'non-interleaving' sections of app code within a method, eg via a using block)

ReubenBond avatar Mar 25 '22 17:03 ReubenBond

Regarding stream producers I am not sure I think that a consumer error (timeout or not) should be propagated back to the producer. The producer should only get delivery exceptions to the provider. AFAIR in SMS unless the fire/forget is marked the producer does get the timeoutexception.

amccool avatar Mar 25 '22 18:03 amccool

IMHO - let reminders be big and phat, nearly unbounded, unless defined by the user. The down side is the usual case of scheduling too many big operations and each one can not complete before the next one is scheduled.

amccool avatar Mar 25 '22 22:03 amccool

As a user, I vote for option 5), requiring the [LongRunning] attribute and keeping the default timeout for everything else.

I always sell the 30s timeout as Orleans enforcing a responsiveness guarantee for us, and I've seen benefits in the way it prods developers to think about how code should be structured so it's responsive - meaning either it completes within the timeout or it fails fast. So I think that behaviour should stay by default, as it's highly educational and the timeouts are useful to detect problems in the design.

When some methods must indeed be long running, I think we should be explicit about it, so we don't get things hanging unexpectedly when they should timeout as usual. It would also be nice to specify a custom timeout, maybe a static one on the attribute itself, or an expectation set by the caller somehow, e.g. via the RequestContext as stated. This would only work if the implementation is marked as [LongRunning].

It would be extra nice to have some other way to cancel long-running distributed workloads. The GrainCancellationTokens work just fine, but they look odd in user code as we often need to tie them with regular cancellation tokens.

JorgeCandeias avatar Mar 31 '22 17:03 JorgeCandeias

Is there a clear solution for this issue yet?

canitez avatar Jul 05 '22 15:07 canitez

We could also support having a single CancellationToken argument in grain interface method signatures.

I think this makes sense - i.e., to use the standard .NET async method cancellation pattern. Developers should understand that async methods need to flow cancellation tokens down to all invoked async methods.

I think a global configuration that specifies all calls to be long-running makes sense. Ultimately, that is setting the timeout for all grain method calls to be infinite. I think a [Timeout] attribute applied to methods also makes sense (to override the global timeout setting), where a parameter specifies the timeout and also allows an infinite timeout to be specified.

How are grains notified when the silo containing the grains is being stopped (i.e., Silo.StopAsync has been invoked)? Long-running in-flight grain operations arguably need to be informed when their containing silo is being stopped so those long-running operations can be stopped gracefully. But there also needs to be to immediately cancel all in-flight grain operations when the cancellation token passed to Silo.StopAsync(CancellationToken) is cancelled.

As such, there would arguably be benefit in making two cancellation tokens available to grain methods - one triggered when Silo.StopAsync is invoked, and the other triggered when the cancellation token passed to Silo.StopAsync(CancellationToken) is triggered (or if/when the Silo.StopAsync operation times out).

Perhaps the cancellation token passed as a grain method parameter should cover the latter case (i.e., immediately cancels the method) and the former case (when Silo.StopAsync is invoked) be covered with a cancellation token made available via dependency injection? This cancellation token triggered when Silo.StopAsync is invoked would only be needed for long-running grain operations.

bill-poole avatar Jul 06 '22 15:07 bill-poole

I think a global configuration that specifies all calls to be long-running makes sense. Ultimately, that is setting the timeout for all grain method calls to be infinite.

Infinite timeouts across the board could lead to unexpected behavior, eg calls which hang without any indication.

Opting in to arbitrarily long-running calls only when the method has a CancellationToken parameter is the right approach, I believe. It's intuitive enough and pushes developers into a good practice. It can be verbalized as "The default cancellation for every call is the global ResponseTimeout value, but you can override that by providing your own CancellationToken"

How are grains notified when the silo containing the grains is being stopped (i.e., Silo.StopAsync has been invoked)?

DeactivateAsync, which now (in 4.x) includes a reason code. For example, DeactivationReasonCode.ShuttingDown in the case of app shutdown.

ReubenBond avatar Jul 06 '22 17:07 ReubenBond

Infinite timeouts across the board could lead to unexpected behavior, eg calls which hang without any indication.

Yes good point. It would be bad to eliminate the timeout safety net for resolving deadlocks caused by cycles on non-interleaved/re-entrant grains on a global basis.

Opting in to arbitrarily long-running calls only when the method has a CancellationToken parameter is the right approach, I believe. It's intuitive enough and pushes developers into a good practice. It can be verbalized as "The default cancellation for every call is the global ResponseTimeout value, but you can override that by providing your own CancellationToken".

Presumably you mean you can override that by providing your own CancellationToken and by decorating the method with an attribute that overrides the configured default/global timeout behaviour? I would imagine that the default behaviour for a method with a CancellationToken parameter would be for that token to be triggered if the timeout expires. i.e., you'd need to provide a CancellationToken parameter and override the default timeout behaviour for a method to be "long-running".

Perhaps a [Timeout] attribute could be applied to a grain method to specify the desired timeout for that method? i.e., an infinite timeout could be specified on a per-method basis, not as a global default. i.e., [Timeout(TimeSpan.InfiniteTimeSpan)] would have the same semantics as the [LongRunning] attribute proposed above.

DeactivateAsync, which now (in 4.x) includes a reason code. For example, DeactivationReasonCode.ShuttingDown in the case of app shutdown.

Excellent. I assume you mean OnDeactivateAsync? Should an override of this method be provided that accepts a CancellationToken? i.e., should a timeout be configurable for deactivating a grain? If Silo.StopAsync(CancellationToken) is invoked, then should the CancellationToken not flow to the OnDeactivateAsync method?

bill-poole avatar Jul 06 '22 18:07 bill-poole

Presumably you mean you can override that by providing your own CancellationToken and by decorating the method with an attribute that overrides the configured default/global timeout behaviour? I would imagine that the default behaviour for a method with a CancellationToken parameter would be for that token to be triggered if the timeout expires. i.e., you'd need to provide a CancellationToken parameter and override the default timeout behaviour for a method to be "long-running".

I mean just having a cancellation token parameter. I think [LongRunning] (etc) shouldn't be necessary in that case.

Excellent. I assume you mean OnDeactivateAsync? Should an override of this method be provided that accepts a CancellationToken?

Yes, it does accept a CancellationToken. It cancels after a configurable timeout, but we could also cancel it once the StopAsync cancellation token is canceled. Feel free to open an issue to discuss/track that, since it's not related to this issue.

Similarly, for [Timeout(TimeSpan)], that should be discussed on a new issue, linking back to https://github.com/dotnet/orleans/issues/4328

ReubenBond avatar Jul 06 '22 18:07 ReubenBond

I mean just having a cancellation token parameter. I think [LongRunning] (etc) shouldn't be necessary in that case.

Personally, I would expect the CancellationToken passed as a parameter to be triggered if/when the timeout expires. I would find it counterintuitive for the presence of a CancellationToken parameter to override the default timeout behaviour to be infinite. Apologies if I have misunderstood your proposal.

bill-poole avatar Jul 06 '22 18:07 bill-poole

What you're suggesting sounds fine to me, too.

ReubenBond avatar Jul 06 '22 18:07 ReubenBond

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.

ghost avatar Jul 28 '22 23:07 ghost

Note that gRPC provides a heartbeating mechanism. This would be another benefit of Orleans using gRPC for all grain-to-grain communication (using whatever serializer is configured - it doesn't have to be Protobuf) as I proposed here.

bill-poole avatar Jul 29 '22 15:07 bill-poole

I like [LongRunning] and leaving everything the same.

I'd like to see an overload for [LongRunning(Timeout="00:02:00"] to allow for setting timeout on individual grain method.

Also, a mechanism to set how often a heartbeat should happen before it's killed: [LongRunning(Timeout="01:00:00", HeartbeatInterval="00:00:30" ].

ElanHasson avatar Aug 12 '22 20:08 ElanHasson

Manually implementing polling is tricky and error-prone. Polling introduces an (arbitrarily chosen) heartbeat delay, which reduces responsiveness. Depending on the chosen delay and the actual elapsed time of the method, polling also introduces overhead which may be non-negligible. In high traffic scenarios, it might turn detrimental.

Extending the timeout for individual methods (e.g. [LongRunning]) is a small improvement, but the overridden timeout is still an arbitrary choice, which I don't like.

I used to prefer the idea of a grain callback, and had this implemented straight-forwardly in a grain-to-grain scenario. However, I now find I need to support client-to-grain calls as well, where a callback seems impossible (?).

My counter-proposal is to add support for IAsyncEnumerable<T> as the grain method result type, and leverage this for long-running methods. (#6504 #7830)

A long-running grain method may then periodically (at, say, half the timeout span) yield return a T which indicates that processing is still running, and, importantly, resets the timeout timer. The final T is the end result. An exception must be propagated naturally. (The interpretation of T is part of the grain contract - it can even include progress information or partial results.)

Supporting IAsyncEnumerable<T> obviously also enables a host of other lightweight streaming scenarios (like GRPC, and without the need for an expensive streaming backend).

ugumba avatar Jun 23 '23 08:06 ugumba