heroic icon indicating copy to clipboard operation
heroic copied to clipboard

Heroic appears to mis-handle `EsRejectedExecutionException` errors from ElasticSearch in batched queries

Open rf opened this issue 4 years ago • 2 comments

I'm seeing an issue in production where a EsRejectedExecutionException result from ElasticSearch will cause Heroic to return a correct-looking response but with empty results and empty errors. The responses look like this:

    "redactedkey:1": {
      "queryId": "21ee7a67-1023-495b-b2dd-55609d914f7a",
      "range": {
        "start": 1593316800000,
        "end": 1595736600000
      },
      "trace": {
        "what": {
          "name": "com.spotify.heroic.CoreQueryManager#query"
        },
        "elapsed": 537117,
        "children": [
          {
            "what": {
              "name": "com.spotify.heroic.CoreQueryManager#query_shard[{site=us-east1}]"
            },
            "elapsed": 537157,
            "children": [
              {
                "what": {
                  "name": "grpc://heroic-2.heroic.default.svc.cluster.local.:9698#query"
                },
                "elapsed": 537078,
                "children": [
                  {
                    "what": {
                      "name": "com.spotify.heroic.metric.LocalMetricManager#query"
                    },
                    "elapsed": 327049,
                    "children": []
                  }
                ]
              }
            ]
          }
        ]
      },
      "limits": [],
      "cached": false,
      "cache": null,
      "commonTags": {},
      "commonResource": {},
      "result": [],
      "preAggregationSampleSize": 0,
      "errors": []
    },

And in our heroic logs:

Failed to execute phase [fetch],
21:47:18.350 [elasticsearch[_client_][listener][T#3]] ERROR com.spotify.heroic.rpc.grpc.GrpcRpcProtocolServer - 29c3cfe3-ca6c-4c16-bc16-b755c8867848: Request failed
        ... 48 more
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:94)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.search.FetchSearchPhase$1@4606797a on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@433876a3[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 3259109]]]
        at java.lang.Thread.run(Thread.java:748)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)

Heroic should pass this response along to the client in the errors array.

rf avatar Aug 03 '20 22:08 rf

Thanks for reporting this! Just to confirm, are you running version 2.2.0 or an earlier one?

lmuhlha avatar Aug 03 '20 22:08 lmuhlha

We are running 0.10.5 at the moment.

memory avatar Aug 04 '20 13:08 memory