presto
presto copied to clipboard
Allow Presto Coordinator to ignore (not throw) negative runtime metrics.
We recently had this situation:
Coordinator's log was full of exception call stacks:
java.lang.IllegalArgumentException: size is negative
at io.airlift.units.Preconditions.checkArgument(Preconditions.java:26)
at io.airlift.units.DataSize.
Clients report that 'query is gone' or something similar. Coordinator UI is unresponsive. CPU is unused.
All this was caused by a negative metric (rawInputDataSize) returned by native worker, which has been fixed already.
Expected Behavior or Use Case
It might make sense not to be so strict about runtime metrics and stats and still allow the query and endpoints requesting query stats/state to go through the usual process.
Presto Component, Service, or Connector
Not sure
Possible Implementation
- Remove the checks (they are, unfortunately in the AirLift (io/airlift/units/DataSize.java) repo.
- Catch the exception and fallback to something (probably too expensive and will break the work flow).
Example Screenshots (if appropriate):
Context
A simple mistake in a single metric can put Presto cluster into some weird state where queries fail with 'query is gone' error and Coordinator UI is unresponsive. Would be nice to make Presto more resilient to such problems.