gcp-ingestion icon indicating copy to clipboard operation
gcp-ingestion copied to clipboard

structured-decoder-loader latency event

Open whd opened this issue 3 years ago • 1 comments

event

I presently have no idea what caused this. Ingestion traffic around the time looked normal, and the recent edge deployment / schemas updates completed over 24h before this event. No other job types (raw, decoded) or pipeline families (telemetry) were affected. I deleted the structured loader pods but I didn't see any errors in their logs before deleting them except a single one:

Jul 10, 2021 3:37:25 AM com.google.cloud.pubsub.v1.StreamingSubscriberConnection$2 onFailure
WARNING: failed to send operations
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:69)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
        at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
        at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1074)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:771)
        at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:553)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:68)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:718)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
        at io.grpc.Status.asRuntimeException(Status.java:535)
        ... 14 more

which if I'm doing my timestamp calculations right is too early to likely be related. Some 10-30 minutes after I deleted and recreated the pods latency returned to nominal, and I'm unsure whether deleting them did anything or whether the system would have gracefully recovered anyway.

Apart from not being able to explain this, I would consider detailed investigation low priority unless it happens again.

whd avatar Jul 10 '21 08:07 whd

This occurred on the telemetry subscription as well some six hours later.

The difference is that logging is much clearer in this case: GCS objects the loader is expecting to load are missing. Another perhaps related difference is that the overall number of unacked messages didn't grow during this event, only the oldest unacked message (usually this means a single message in a batch is unrouteable, but the error logs don't indicate that in this case).

extendedStackTrace: "com.google.cloud.bigquery.BigQueryException: Not found: Files /bigstore/moz-fx-data-prod-ingestion-sink/decoded.telemetry.bq-sink-payload/OUTPUT_TABLE=moz-fx-data-shared-prod.telemetry_live.modules_v4/dc58477c-3405-4075-b2e9-60a46c9628f4.ndjson
	at com.google.cloud.bigquery.Job.reload(Job.java:411) ~[google-cloud-bigquery-1.128.1.jar:1.128.1]
	at com.google.cloud.bigquery.Job.waitFor(Job.java:248) ~[google-cloud-bigquery-1.128.1.jar:1.128.1]
	at com.mozilla.telemetry.ingestion.sink.io.BigQuery$Load$Batch.close(BigQuery.java:220) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
	at com.mozilla.telemetry.ingestion.sink.util.BatchWrite$Batch.synchronousClose(BatchWrite.java:196) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
	at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:966) [?:1.8.0_292]
	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) [?:1.8.0_292]
	at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457) [?:1.8.0_292]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_292]
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_292]
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_292]
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) [?:1.8.0_292]
"

A pod delete/recreate appears to have been effective in this case too.

whd avatar Jul 10 '21 14:07 whd