grpc-rs icon indicating copy to clipboard operation
grpc-rs copied to clipboard

Client Stream of a streaming server can't be Boxed

Open illicitonion opened this issue 7 years ago • 9 comments

I've put a repro at https://github.com/illicitonion/grpc-rs-repro-0 which uses the proto from https://github.com/googleapis/googleapis/blob/master/google/bytestream/bytestream.proto

The call to call_sync, which doesn't move the stream out of the function, reliably works.

The call to call_future, which Boxes the stream, either gives me:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "RpcFailure(RpcStatus { status: Unknown, details: Some(\"Failed to create subchannel\") })"', src/libcore/result.rs:860:4
stack backtrace:
   0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
   1: std::panicking::default_hook::{{closure}}
   2: std::panicking::default_hook
   3: std::panicking::rust_panic_with_hook
   4: std::panicking::begin_panic_new
   5: std::panicking::begin_panic_fmt
   6: rust_begin_unwind
   7: core::panicking::panic_fmt
   8: core::result::unwrap_failed
   9: <core::result::Result<T, E>>::unwrap
  10: grpcrepro::main
  11: __rust_maybe_catch_panic
  12: std::rt::lang_start
  13: main

or

E1201 22:37:55.116403000 140735833875264 call.c:1731]                  assertion failed: grpc_cq_begin_op(call->cq, notify_tag)
Abort trap: 6

I believe both should be safe, and it's possible that the Stream being produced by the client isn't properly owning all of the state it needs.

illicitonion avatar Dec 01 '17 22:12 illicitonion

That's because channel get dropped early. But it should return error instead of core.

BusyJay avatar Dec 02 '17 06:12 BusyJay

If the Call output by Channel::create_call relies on the underlying Channel not being dropped, shouldn't Call either take ownership of the underlying Channel (or an Arc pointing to it), or keep a reference to it (forcing you to manage the lifetime of Channel properly)?

illicitonion avatar Dec 02 '17 09:12 illicitonion

Thanks for the suggestion. But channel may become invalid even a reference is kept, which can just ensure it's not dropped by the wrapper. And actually there are two kinds of calls here, client call and server call. For client call, we can get the channel easily, but for server call there is no API to obtain the underlying channel yet. So I think it's better to just return an error to state what's happening.

Besides, when env is dropped, all the completion queues are shutdown too, hence keeping a reference to channel is still useless.

We may need to explain the whole mechanism in our documentations. /cc #114

And I believe the coredump should be fixed by #125.

BusyJay avatar Dec 05 '17 14:12 BusyJay

Besides, when env is dropped, all the completion queues are shutdown too, hence keeping a reference to channel is still useless.

Then it sounds like the clone of the env (or at least an Arc reference to it) should be held in the Stream implementation?

While it's not a memory safety issue, it's definitely a deviation from the implicit contract of Streams, and represents something that should probably be fixed rather than documented.

stuhood avatar Dec 05 '17 19:12 stuhood

But channel may become invalid even a reference is kept, which can just ensure it's not dropped by the wrapper. For client call, we can get the channel easily, but for server call there is no API to obtain the underlying channel yet.

This can, on the server side, happen in a wide variety of contexts. However, on the server-side, it is unlikely that the server/environment/whatever gets dropped in the middle of a call, so covering the client side would cover the large majority of cases where this is a nuisance.

Currently, simple calls such as :

let answers_client = answers_grpc::AnswersClient::open_with_defaults();
let answers_stream = answers_client
        .as_publish_opt(
            &AnswersRequest::default(),
            call_config(),
        )
        .expect("Could not get answers");

succeed, while these:

let answers_stream = answers_grpc::AnswersClient::open_with_defaults()
        .as_publish_opt(
            &AnswersRequest::default(),
            call_config(),
        )
        .expect("Could not get answers");

don't, and since no lifetimes are involved, this is a terrible trap.

I also feel like Call should hold the Arc on the ChannelInner when built through Channel::create_call().

Another alternative that avoids the issue with the Call being different on the server side and on the client side would be to hold the Arc in the ClientXXXReceivers.

Ten0 avatar Oct 28 '19 18:10 Ten0

This PR implements the latter by holding the Channel.

Ten0 avatar Oct 29 '19 08:10 Ten0

Holding Channel inside requests can lead to resource leak. For example, a long run combinator of futures can prevent Receiver or Sender from being dropped, which cause the channel not being released.

The provided example is not a trap if the contract is fully understood. Not all lifetime dependency should be constrained by lifetime. For example, when using a thread pool to spawn a task, no one thinks the task can still succeed after dropping the pool. It's considered a valid case that the future should be resolved as failure.

let f = tokio_threadpool::ThreadPool::new().spawn_handle(lazy(|| Ok::<_, ()>(42)));
f.wait().expect("could not finished.");

BusyJay avatar Oct 31 '19 05:10 BusyJay

To be honest I'm not sure I would necessarily know there's a problem if I see that tokio example somewhere. ^^' And that is even though the need to call expect("could not finished.") exists for the single reason that futures may end up in a state where the future will never be executed, which undeniably helps in suspecting that behaviour.

In our case, it's pretty clear that the call can fail if there's a networking issue, is cancelled by the server, etc... which already justifies the presence of grpcio::Errors on the stream, so the presence of this error does not force you to think of "hey, you need to ensure by yourself that you don't drop the channel" like the wait() returning a Result does in tokio.

Also, In the case of tokio, I could imagine a scenario where one would think "I'm done with this whole futures set, let's close its thread pool". I can't think of a similar scenario in a case where you're in the middle of a bunch of gRPC calls and you would want to kill the channel, and even less so where having something like channel.close() wouldn't solve it.

Of course "if the contract is fully understood", well, you know. The point is that we've been several people not understanding this contract at first sight, so maybe we could make it easier to understand - be it by changing it or not.

Finally, I don't understand the resource-leak case you are talking about in the first paragraph. If the combinator of futures isn't meant to ever resolve, why is it spawned/not dropped in the first place ? Can you give a little more detailed example?

Ten0 avatar Oct 31 '19 20:10 Ten0

Users can close the channel for quickly shutdown or whatever suitable. Documentation needs to explain the mechanism clearly as I suggested.

BusyJay avatar Nov 01 '19 10:11 BusyJay