RPC timeout set via service config is off by ~1s
What version of gRPC and what language are you using?
grpcio==1.54.2
What operating system (Linux, Windows,...) and version?
OSX Ventura 13.6
What runtime / compiler are you using (e.g. python version or version of gcc)
Python 3.11.4
What did you do?
tl;dr I find that setting an RPC via service config needs ~1 full second higher than the client/server will actually take. Happens across a variety of timings, and ONLY applies to the service config and NOT to setting a timeout at the actual RPC call.
Started a simple gRPC server that sleeps for a small time. Then perform a client call with a timeout in the service config for 1 second more than the configured sleep, which fails unexpectedly and in a shorter time than the timeout.
The below example should sleep for .2s, should timeout in 1.2s, but the client returns a DEADLINE_EXCEEDED in less than .2s.
Server
from concurrent import futures
import time
import grpc
import greeting_pb2
import greeting_pb2_grpc
class Greeter(greeting_pb2_grpc.GreeterServicer):
def greet(self, request, context):
print("Got request " + str(request))
time.sleep(0.2)
return greeting_pb2.ServerOutput(message="{0} {1}!".format(request.greeting, request.name))
def server():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=2))
greeting_pb2_grpc.add_GreeterServicer_to_server(Greeter(), server)
server.add_insecure_port("[::]:50051")
print("gRPC starting")
server.start()
server.wait_for_termination()
server()
Client
import time
import grpc
import greeting_pb2
import greeting_pb2_grpc
options = [("grpc.service_config", '{"methodConfig": [{"timeout": "1.2s", "name": [{}]}] }')]
channel = grpc.insecure_channel(
target="dns:///localhost:50051",
options=options,
)
stub = greeting_pb2_grpc.GreeterStub(channel)
start = time.perf_counter()
try:
response = stub.greet(greeting_pb2.ClientInput(name="Me", greeting="Yo"))
except Exception as e:
response = e
print(response, " in ", time.perf_counter() - start)
What did you expect to see?
I expect this RPC to succeed.
What did you see instead?
This RPC fails in a shorter time than the timeout. For the example above it fails at 0.14ish seconds rather than 1.2ish seconds.
<_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2023-09-29T11:00:52.966757-05:00"}"
> in 0.14907191699603572
This will then succeed if I change the timeout to 1.3 seconds, and this pattern is repeatable for higher values. e.g. a server that sleeps for 3 seconds needs a serviceConfig timeout of ~4.1 seconds to succeed.
However, using the timeout in the RPC call instead seems fine. It doesn't seem to have the same off-by-one issue that I find in the service config.
Anything else we should know about your project / environment?
Bump!
This looks interesting, Protobuf serialization and deserialization might add some overhead on server handler but 1s sounds too much, can you turn on additional logging using those env vars: GRPC_VERBOSITY=DEBUG GRPC_TRACE=all,-timer,-timer_check?
@XuanWang-Amos sure thing - that's pretty verbose even from this little call though - any particular way I can distribute it easy here?
And do you want the server side, client side, or both?
Of note though from my reading: from the client side the very first log line is
D1005 13:33:05.331750000 8062492800 config.cc:175] gRPC EXPERIMENT tcp_frame_size_tuning OFF (default:OFF)
and the first indication of CANCELLED is
I1005 13:33:05.486479000 6131757056 dynamic_filters.cc:81] OP[retry_filter:0x140830860]: CANCEL:UNKNOWN:Deadline Exceeded {created_time:"2023-10-05T13:33:05.484865-05:00", grpc_status:4}
So it gets to CANCELLED within 0.15s of the very first log line even when the service config timeout is 1.2s
@justincely Just the client side should be sufficient.
that's pretty verbose even from this little call though - any particular way I can distribute it easy here?
I believe Github will let you do file uploads if you drag-and-drop the text file into a comment textbox. Otherwise, pastebin should work just fine. What we're really after is whether the gRPC library truly believes that it's been configured with 1.2s or not.
@yashykt Can you please take a look? Is this expected to work? In particular, is an empty list of methods considered wildcard and matches all methods? Or is the issue here simply that the retry isn't applied to any methods?
Thanks for the eyes. FWIW the retry policy does work configured like this (as in, i can see retries happening and timeouts increasing or decreasing), and the empty list of methods is supposed to work this way according to the proto: https://github.com/grpc/grpc-proto/blob/master/grpc/service_config/service_config.proto#L58.
@yashykt any updates? any more information I can provide?
Following for updates, have the same issue as well.
@yashykt PTAL.
This finally rose up in my priority list. Sorry for the late reply! This needs some debugging on my part.
In particular, is an empty list of methods considered wildcard and matches all methods?
"name": [{}] is not an empty list. It has exactly one element with an empty service and method name. Since this is the only element, this will apply to everything.
Thanks for reporting this! Much appreciated!
I've got a fix out in #36136
Any updates?
Bump
@yashykt PTAL
Sorry this fell off my radar. That PR needs additional work. If someone has some time, please feel free to pick it up.
Any updates on this issue? Have been experienceing this too.