Allow Presto Coordinator to ignore (not throw) negative runtime metrics.

Open spershin opened this issue 1 year ago • 1 comments

We recently had this situation: Coordinator's log was full of exception call stacks: java.lang.IllegalArgumentException: size is negative at io.airlift.units.Preconditions.checkArgument(Preconditions.java:26) at io.airlift.units.DataSize.(DataSize.java:58) at io.airlift.units.DataSize.succinctDataSize(DataSize.java:48) at io.airlift.units.DataSize.succinctBytes(DataSize.java:43) ... at com.facebook.presto.execution.scheduler.SqlQueryScheduler.getBasicStageStats(SqlQueryScheduler.java:853) ... at com.facebook.presto.dispatcher.DispatchManager.getQueries(DispatchManager.java:392) ...

Clients report that 'query is gone' or something similar. Coordinator UI is unresponsive. CPU is unused.

All this was caused by a negative metric (rawInputDataSize) returned by native worker, which has been fixed already.

Expected Behavior or Use Case

It might make sense not to be so strict about runtime metrics and stats and still allow the query and endpoints requesting query stats/state to go through the usual process.

Presto Component, Service, or Connector

Not sure

Possible Implementation

Remove the checks (they are, unfortunately in the AirLift (io/airlift/units/DataSize.java) repo.
Catch the exception and fallback to something (probably too expensive and will break the work flow).

Example Screenshots (if appropriate):

Context

A simple mistake in a single metric can put Presto cluster into some weird state where queries fail with 'query is gone' error and Coordinator UI is unresponsive. Would be nice to make Presto more resilient to such problems.

Jun 25 '24 22:06 spershin

I propose option 3: catch and handle the exception, or maybe just fail the query, but definitely catch it. Don't remove the checks.

Jun 29 '24 12:06 elharo