gax-java icon indicating copy to clipboard operation
gax-java copied to clipboard

feat: Enforce RPC deadlines even when GRPC does not

Open dpcollins-google opened this issue 3 years ago • 9 comments

Environment details

  • Programming language: Java
  • OS: Linux

Steps to reproduce

  1. Create a singleton RPC call for the Pub/Sub Lite cursor service using the blocking generated GAX surface, or using the futures surface followed by a call to get()
  2. It blocks forever (or at least 100 hours) in some edge case, despite a timeout of 300 seconds in the service configuration.

dpcollins-google avatar Dec 07 '21 20:12 dpcollins-google

Is this a regression, only manifested in new library or dependency versions? And although it sounds like this is not something easily reproducible, any small sample or snippet that demonstrates this?

chanseokoh avatar Dec 07 '21 20:12 chanseokoh

This is unclear, but I had not experienced this in the past, so it is likely a recent (O(months) though) regression.

An example code snippet which triggered this from the apache beam repo is:

CursorServiceClient newCursorServiceClient() { ... }

newCursorServiceClient()
    .commitCursor(
        CommitCursorRequest.newBuilder()
                    .setSubscription(options.subscriptionPath().toString())
                    .setPartition(partition.value())
                    .setCursor(Cursor.newBuilder().setOffset(offset.value()))
                    .build());

dpcollins-google avatar Dec 07 '21 20:12 dpcollins-google

I see the transport of pubsublite v1 is gRPC. @vam-google any thoughts?

chanseokoh avatar Dec 07 '21 22:12 chanseokoh

@chanseokoh There are no other clients besides java-compute depending on rest transport right now. So it is safe to ssume that all reported issues, if they are not compute related are gRPC.

vam-google avatar Dec 08 '21 22:12 vam-google

I just created my own pipeline- I'm able to recreate this fairly frequently, where the future takes over a minute to finish. It has the following (truncated) stacktrace:

java.util.concurrent.TimeoutException: Waited 1 minutes (plus 834188 nanoseconds delay) for com.google.api.gax.retrying.CallbackChainRetryingFuture@32bd4ca3[status=PENDING]
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:527)
	at org.apache.beam.sdk.io.gcp.pubsublite.internal.SubscriberAssembler.lambda$getCommitter$0(SubscriberAssembler.java:106)
	at org.apache.beam.sdk.io.gcp.pubsublite.internal.PerSubscriptionPartitionSdf.lambda$processElement$0(PerSubscriptionPartitionSdf.java:88)
	at java.base/java.util.Optional.ifPresent(Optional.java:183)
	at org.apache.beam.sdk.io.gcp.pubsublite.internal.PerSubscriptionPartitionSdf.processElement(PerSubscriptionPartitionSdf.java:84)
	at org.apache.beam.sdk.io.gcp.pubsublite.internal.PerSubscriptionPartitionSdf$DoFnInvoker.invokeProcessElement(Unknown Source)
	...

dpcollins-google avatar Dec 09 '21 20:12 dpcollins-google

P1 out of SLO, please take a look & triage

AlanGasperini avatar Jan 04 '22 18:01 AlanGasperini

To provide more information, it appears that in this case the issue is with executor exhaustion at the GRPC layer preventing the grpc future from ever returning. However, it would be useful to enforce deadlines on the gax future (i.e. complete it early) even if GRPC never completes the request.

dpcollins-google avatar Jan 04 '22 20:01 dpcollins-google

@dpcollins-google Is there a corresponding issue filed against gRPC? Also, can we change this to a feature request and downgrade the priority? Thanks!

meltsufin avatar Jan 04 '22 20:01 meltsufin

I checked with @dpcollins-google offline and we agreed to change this to a feature request and downgrade to p2.

meltsufin avatar Jan 04 '22 21:01 meltsufin