tonic icon indicating copy to clipboard operation
tonic copied to clipboard

Deadline propagation

Open dfawley opened this issue 6 months ago • 5 comments

One important feature of gRPC is automatic (or hard-to-forget) propagation of deadlines between incoming calls and outgoing calls. All core gRPC languages have something in place to satisfy this requirement. As an example, in Go, a stdlib context with a timeout is provided to the server's method handler, which is then passed as a required parameter when making any outgoing calls.

Tonic's API, while ergonomic for basic use cases, unfortunately makes it easy to forget to set a deadline. This happens for two reasons:

  1. The Request::new constructor doesn't require or even accept a deadline. It must be set as a separate operation.
  2. In the proto generated code, RPC operations accept an IntoRequest (or IntoStreamingRequest) which are blanked implemented for any message type. This allows calls to accept messages directly, so even if the Request::new constructor accepted a deadline, it could be missed because of this shortcut.

I would recommend something like the following:

  1. Request::new is replaced by two new constructors:
  • One constructor accepts a tonic::Request<_> from which it will retrieve the deadline and propagate it to the new request.
  • One constructor is named in a way to indicate that the outgoing request is not done in service of an incoming request. This should potentially include a deadline: Option<time::Instant> or timeout: Option<time::Duration> parameter. (Aside: we should make sure naming is consistent everywhere: a "deadline" is a point in time, and a "timeout" is a duration in the future.)
  1. IntoRequest<T> is changed back to Request<T> in the proto generated code.

This would be a breaking change that is likely to be unpopular with some (many?) users. The latter change might justify maintaining a new major version of codegen unless we are willing to break users for this change. Looking at the tonic repo, however, I don't see any examples that take advantage of the IntoRequest functionality, even though most could do so -- they all manually call Request::new().

dfawley avatar Jun 27 '25 18:06 dfawley

quick drive by thought:

We could use something like https://docs.rs/tokio/latest/tokio/macro.task_local.html to implicitly set a deadline. We can even have a free function that would allow the task local to be unset to not propagate it to make it opt-out. How this would work is like so:

  1. the server sets the task local before calling the handler
  2. inside the handler any tonic client can pick up this task local and copy the value
  3. to disable the timeout would have a type like Option<Timeout> and the free function would just unset this option.

@dfawley Thoughts?

LucioFranco avatar Jun 27 '25 21:06 LucioFranco

We've considered task local storage before for this kind of thing in the past. That makes propagation implicit, which not quite as good as being explicit IMO. It does match how Java works, so there is precedent, but the gRPC-Java TL has mentioned to me in the past that many users are unhappy with this behavior, and go to large lengths to work around it.

This particular implementation probably doesn't work for us, since it requires Tokio rt feature, which we ultimately would like to avoid. So it would either need to be in our runtime trait, or we'd need to find something more standalone / roll our own.

Can a task local like this be automatically propagated to child tasks somehow? I assume that's impossible, which would be another downside: users would have to manually transfer it if they are spawning new tasks, which is error-prone.

I'm not strictly opposed to the idea, but it does seem there are some downsides. My initial reaction is that it's probably preferable to fork the codegen than to use task locals.

dfawley avatar Jun 30 '25 16:06 dfawley

After discussing this in some detail with several people, it does seem like a task local is the preferred approach.

  1. If propagation is explicit, that requires plumbing the request type (or at least its deadline) through multiple layers if a library you are using wishes to make an outgoing call.

  2. It is probably more likely to call functions that call functions that call functions that make RPCs than it is to spawn new tasks that make RPCs.

  3. We can mitigate the difficulty of propagating the deadline to child tasks by providing a trivial wrapper that users can call on their futures when spawning new tasks. There is no corollary for helping users propagate the deadline explicitly.

Optionally, we could make our runtime trait do the propagation in (3) automatically by wrapping the user's runtime in one that simply adds that behavior when spawn is called. And we could pass that runtime to server handlers.

@LucioFranco does all of this SGTY?

dfawley avatar Jul 10 '25 20:07 dfawley

The task local approach might suffer from context leak between requests in some cases, like for example if first request creates a common task to create downstream request (because the task has access to some open resource) and then all the requests use messages between that task to create downstream requests. In this case the context from first requests leaks into all the requests.

As the context passing is implicit, it would be hard to debug as there can be multiple level of tasks

ankurmittal avatar Sep 16 '25 18:09 ankurmittal

like for example if first request creates a common task to create downstream request (because the task has access to some open resource) and then all the requests use messages between that task to create downstream requests

Outbound requests do not create deadlines in contexts. Only inbound requests do that. There is no danger of confusing multiple inbound requests, as those are all separate function calls in new tasks.

Implicit is harder to debug, because it's magic, but I think it's better than the alternatives in this case.

dfawley avatar Sep 16 '25 18:09 dfawley