grpc icon indicating copy to clipboard operation
grpc copied to clipboard

RPC timeout set via service config is off by ~1s

Open justincely opened this issue 2 years ago • 16 comments

What version of gRPC and what language are you using?

grpcio==1.54.2

What operating system (Linux, Windows,...) and version?

OSX Ventura 13.6

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.11.4

What did you do?

tl;dr I find that setting an RPC via service config needs ~1 full second higher than the client/server will actually take. Happens across a variety of timings, and ONLY applies to the service config and NOT to setting a timeout at the actual RPC call.

Started a simple gRPC server that sleeps for a small time. Then perform a client call with a timeout in the service config for 1 second more than the configured sleep, which fails unexpectedly and in a shorter time than the timeout.

The below example should sleep for .2s, should timeout in 1.2s, but the client returns a DEADLINE_EXCEEDED in less than .2s.

Server

from concurrent import futures
import time

import grpc
import greeting_pb2
import greeting_pb2_grpc


class Greeter(greeting_pb2_grpc.GreeterServicer):
    def greet(self, request, context):
        print("Got request " + str(request))
        time.sleep(0.2)
        return greeting_pb2.ServerOutput(message="{0} {1}!".format(request.greeting, request.name))


def server():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=2))
    greeting_pb2_grpc.add_GreeterServicer_to_server(Greeter(), server)
    server.add_insecure_port("[::]:50051")
    print("gRPC starting")
    server.start()
    server.wait_for_termination()


server()

Client

import time
import grpc
import greeting_pb2
import greeting_pb2_grpc

options = [("grpc.service_config", '{"methodConfig": [{"timeout": "1.2s", "name": [{}]}] }')]

channel = grpc.insecure_channel(
    target="dns:///localhost:50051",
    options=options,
)

stub = greeting_pb2_grpc.GreeterStub(channel)
start = time.perf_counter()
try:
    response = stub.greet(greeting_pb2.ClientInput(name="Me", greeting="Yo"))
except Exception as e:
    response = e
print(response, " in ", time.perf_counter() - start)

What did you expect to see?

I expect this RPC to succeed.

What did you see instead?

This RPC fails in a shorter time than the timeout. For the example above it fails at 0.14ish seconds rather than 1.2ish seconds.

<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2023-09-29T11:00:52.966757-05:00"}"
>  in  0.14907191699603572

This will then succeed if I change the timeout to 1.3 seconds, and this pattern is repeatable for higher values. e.g. a server that sleeps for 3 seconds needs a serviceConfig timeout of ~4.1 seconds to succeed.

However, using the timeout in the RPC call instead seems fine. It doesn't seem to have the same off-by-one issue that I find in the service config.

Anything else we should know about your project / environment?

justincely avatar Sep 29 '23 16:09 justincely

Bump!

justincely avatar Oct 04 '23 15:10 justincely

This looks interesting, Protobuf serialization and deserialization might add some overhead on server handler but 1s sounds too much, can you turn on additional logging using those env vars: GRPC_VERBOSITY=DEBUG GRPC_TRACE=all,-timer,-timer_check?

XuanWang-Amos avatar Oct 04 '23 20:10 XuanWang-Amos

@XuanWang-Amos sure thing - that's pretty verbose even from this little call though - any particular way I can distribute it easy here?

And do you want the server side, client side, or both?

justincely avatar Oct 05 '23 17:10 justincely

Of note though from my reading: from the client side the very first log line is

D1005 13:33:05.331750000 8062492800 config.cc:175]                     gRPC EXPERIMENT tcp_frame_size_tuning               OFF (default:OFF)

and the first indication of CANCELLED is

I1005 13:33:05.486479000 6131757056 dynamic_filters.cc:81]             OP[retry_filter:0x140830860]:  CANCEL:UNKNOWN:Deadline Exceeded {created_time:"2023-10-05T13:33:05.484865-05:00", grpc_status:4}

So it gets to CANCELLED within 0.15s of the very first log line even when the service config timeout is 1.2s

justincely avatar Oct 05 '23 18:10 justincely

@justincely Just the client side should be sufficient.

that's pretty verbose even from this little call though - any particular way I can distribute it easy here?

I believe Github will let you do file uploads if you drag-and-drop the text file into a comment textbox. Otherwise, pastebin should work just fine. What we're really after is whether the gRPC library truly believes that it's been configured with 1.2s or not.

gnossen avatar Oct 05 '23 18:10 gnossen

Ahh, looks like you're right! LMK if you need anything else:

client.txt

justincely avatar Oct 05 '23 19:10 justincely

@yashykt Can you please take a look? Is this expected to work? In particular, is an empty list of methods considered wildcard and matches all methods? Or is the issue here simply that the retry isn't applied to any methods?

gnossen avatar Oct 11 '23 20:10 gnossen

Thanks for the eyes. FWIW the retry policy does work configured like this (as in, i can see retries happening and timeouts increasing or decreasing), and the empty list of methods is supposed to work this way according to the proto: https://github.com/grpc/grpc-proto/blob/master/grpc/service_config/service_config.proto#L58.

justincely avatar Oct 13 '23 16:10 justincely

@yashykt any updates? any more information I can provide?

justincely avatar Oct 19 '23 17:10 justincely

Following for updates, have the same issue as well.

svorcan-rho avatar Oct 26 '23 09:10 svorcan-rho

@yashykt PTAL.

XuanWang-Amos avatar Jan 10 '24 18:01 XuanWang-Amos

This finally rose up in my priority list. Sorry for the late reply! This needs some debugging on my part.

In particular, is an empty list of methods considered wildcard and matches all methods?

"name": [{}] is not an empty list. It has exactly one element with an empty service and method name. Since this is the only element, this will apply to everything.

yashykt avatar Mar 12 '24 23:03 yashykt

Thanks for reporting this! Much appreciated!

I've got a fix out in #36136

yashykt avatar Mar 16 '24 00:03 yashykt

Any updates?

makhalin avatar Oct 09 '24 15:10 makhalin

Bump

LuckySting avatar Oct 09 '24 15:10 LuckySting

@yashykt PTAL

sourabhsinghs avatar Oct 22 '24 13:10 sourabhsinghs

Sorry this fell off my radar. That PR needs additional work. If someone has some time, please feel free to pick it up.

yashykt avatar Jan 30 '25 22:01 yashykt

Any updates on this issue? Have been experienceing this too.

sarthak77 avatar Mar 18 '25 04:03 sarthak77