gax-java
gax-java copied to clipboard
feat: Enforce RPC deadlines even when GRPC does not
Environment details
- Programming language: Java
- OS: Linux
Steps to reproduce
- Create a singleton RPC call for the Pub/Sub Lite cursor service using the blocking generated GAX surface, or using the futures surface followed by a call to
get()
- It blocks forever (or at least 100 hours) in some edge case, despite a timeout of 300 seconds in the service configuration.
Is this a regression, only manifested in new library or dependency versions? And although it sounds like this is not something easily reproducible, any small sample or snippet that demonstrates this?
This is unclear, but I had not experienced this in the past, so it is likely a recent (O(months) though) regression.
An example code snippet which triggered this from the apache beam repo is:
CursorServiceClient newCursorServiceClient() { ... }
newCursorServiceClient()
.commitCursor(
CommitCursorRequest.newBuilder()
.setSubscription(options.subscriptionPath().toString())
.setPartition(partition.value())
.setCursor(Cursor.newBuilder().setOffset(offset.value()))
.build());
I see the transport of pubsublite v1 is gRPC. @vam-google any thoughts?
@chanseokoh There are no other clients besides java-compute depending on rest transport right now. So it is safe to ssume that all reported issues, if they are not compute related are gRPC.
I just created my own pipeline- I'm able to recreate this fairly frequently, where the future takes over a minute to finish. It has the following (truncated) stacktrace:
java.util.concurrent.TimeoutException: Waited 1 minutes (plus 834188 nanoseconds delay) for com.google.api.gax.retrying.CallbackChainRetryingFuture@32bd4ca3[status=PENDING]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:527)
at org.apache.beam.sdk.io.gcp.pubsublite.internal.SubscriberAssembler.lambda$getCommitter$0(SubscriberAssembler.java:106)
at org.apache.beam.sdk.io.gcp.pubsublite.internal.PerSubscriptionPartitionSdf.lambda$processElement$0(PerSubscriptionPartitionSdf.java:88)
at java.base/java.util.Optional.ifPresent(Optional.java:183)
at org.apache.beam.sdk.io.gcp.pubsublite.internal.PerSubscriptionPartitionSdf.processElement(PerSubscriptionPartitionSdf.java:84)
at org.apache.beam.sdk.io.gcp.pubsublite.internal.PerSubscriptionPartitionSdf$DoFnInvoker.invokeProcessElement(Unknown Source)
...
P1 out of SLO, please take a look & triage
To provide more information, it appears that in this case the issue is with executor exhaustion at the GRPC layer preventing the grpc future from ever returning. However, it would be useful to enforce deadlines on the gax future (i.e. complete it early) even if GRPC never completes the request.
@dpcollins-google Is there a corresponding issue filed against gRPC? Also, can we change this to a feature request and downgrade the priority? Thanks!
I checked with @dpcollins-google offline and we agreed to change this to a feature request and downgrade to p2.