elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Named query performance degradation after upgrading to 8.13.2 (from 8.7.1)

Open shimpeko opened this issue 1 year ago • 5 comments

Elasticsearch Version

8.13.2

Installed Plugins

No response

Java Version

bundled

OS Version

official docker image

Problem Description

After upgrading from 8.7.1 to 8.13.2, some of our queries are more than 5x times slower (longer response time). After digging, I've identified that the named query is causing the problem. Queries with "_name" get about 5x times slower, while the performance of queries without "_name" stays the same.

We are running ES on K8S using ECK but it can be reproducible with docker environments. The same query takes longer in 8.13.2.

8.7.1

--- 1st RUN ---
{
  "took" : 59,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}
--- 2nd RUN ---
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}
--- 3rd RUN ---
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

8.13.2:

--- 1st RUN ---
{
  "took" : 296,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}
--- 2nd RUN ---
{
  "took" : 200,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}
--- 3rd RUN ---
{
  "took" : 201,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Steps to Reproduce

Please see https://github.com/shimpeko/es_named_query_perf/tree/main

Logs (if relevant)

No response

shimpeko avatar May 13 '24 11:05 shimpeko

This was reported at: https://discuss.elastic.co/t/359189

dadoonet avatar May 13 '24 11:05 dadoonet

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine avatar May 13 '24 20:05 elasticsearchmachine

Looking at the reproduction (thanks for providing one) the issue seems to be around a single query with 4k named term queries. First of all, the reproduction query matches no document hence named queries, which are executed during the fetch phase, are not the culprit. From the number of term queries the main culprit would be https://github.com/apache/lucene/pull/12183 which creates term states concurrently using the searcher executor. Each term in the query creates one task per segment and executes in a different thread. The overhead in this scenario is tens of milliseconds due to the number of terms. It is significative in this setup because none of the terms are present in the dictionary so the work done by the thread is minimal. The Lucene change was made to parallelise the IOs during a single query, in this case they are no IO involved so it ends up hurting the performance. Another strategy is investigated for Lucene 10 where the goal is to rely on system calls to parallelise the IOs (rather than real Java threads). This might limit the impact when no IO is required like in this case. @shimpeko is the scenario exposed here representative of your use case? I expect that the difference in performance should be much smaller when the query terms are actually present in the dictionary.

jimczi avatar May 15 '24 15:05 jimczi

@jimczi Thank you for taking a look at this.

I've already removed named queries from our production query and confirmed it improved response time to the same level as 8.7 with named queries. So I'm still suspecting the named query at the moment. I'll try to reproduce it with a query that matches documents.


Edit: Regarding performance https://github.com/elastic/elasticsearch/issues/108659 is more important for us as we haven't figured out a workaround. For this (possible) named query issue, removing named queries worked for us as a workaround.

shimpeko avatar May 16 '24 07:05 shimpeko

@jimczi I've updated the reproduction query to match documents as https://github.com/shimpeko/es_named_query_perf/commit/b897f9199c3bd405b1165efd6adcd4aef5407ac3 and I still see a notable performance difference between 8.7.1 and 8.13.2

8.7.1

....
--- 3rd RUN ---
{
  "took" : 55,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4000,
      "relation" : "eq"
    },
    "max_score" : 11.1074705,
    "hits" : [
      {
        "_index" : "test_index",
        "_id" : "emGOgI8BQpKfxT3bI0cz",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "atrjmoueyc"
        },
        "matched_queries" : [
          "query = atrjmoueyc"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "e2GOgI8BQpKfxT3bI0c0",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "fmjkwdxgnb"
        },
        "matched_queries" : [
          "query = fmjkwdxgnb"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "fGGOgI8BQpKfxT3bI0c0",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "slaqatgrtw"
        },
        "matched_queries" : [
          "query = slaqatgrtw"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "fWGOgI8BQpKfxT3bI0c0",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "mdvyrihjqq"
        },
        "matched_queries" : [
          "query = mdvyrihjqq"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "fmGOgI8BQpKfxT3bI0c0",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "cvvbunsbyo"
        },
        "matched_queries" : [
          "query = cvvbunsbyo"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "f2GOgI8BQpKfxT3bI0c0",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "aihmsruxby"
        },
        "matched_queries" : [
          "query = aihmsruxby"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "gGGOgI8BQpKfxT3bI0c0",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "lmgsfemmca"
        },
        "matched_queries" : [
          "query = lmgsfemmca"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "gWGOgI8BQpKfxT3bI0c0",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "isatduxwmn"
        },
        "matched_queries" : [
          "query = isatduxwmn"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "gmGOgI8BQpKfxT3bI0c0",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "lvrmulxqyp"
        },
        "matched_queries" : [
          "query = lvrmulxqyp"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "g2GOgI8BQpKfxT3bI0c0",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "bzwcblsdpi"
        },
        "matched_queries" : [
          "query = bzwcblsdpi"
        ]
      }
    ]
  }
}

8.13.2

...
--- 3rd RUN ---
{
  "took" : 448,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4000,
      "relation" : "eq"
    },
    "max_score" : 11.1074705,
    "hits" : [
      {
        "_index" : "test_index",
        "_id" : "Ma2PgI8BD4MPqc5jWoVf",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "atrjmoueyc"
        },
        "matched_queries" : [
          "query = atrjmoueyc"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "Mq2PgI8BD4MPqc5jWoVg",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "fmjkwdxgnb"
        },
        "matched_queries" : [
          "query = fmjkwdxgnb"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "M62PgI8BD4MPqc5jWoVg",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "slaqatgrtw"
        },
        "matched_queries" : [
          "query = slaqatgrtw"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "NK2PgI8BD4MPqc5jWoVg",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "mdvyrihjqq"
        },
        "matched_queries" : [
          "query = mdvyrihjqq"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "Na2PgI8BD4MPqc5jWoVg",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "cvvbunsbyo"
        },
        "matched_queries" : [
          "query = cvvbunsbyo"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "Nq2PgI8BD4MPqc5jWoVg",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "aihmsruxby"
        },
        "matched_queries" : [
          "query = aihmsruxby"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "N62PgI8BD4MPqc5jWoVg",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "lmgsfemmca"
        },
        "matched_queries" : [
          "query = lmgsfemmca"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "OK2PgI8BD4MPqc5jWoVg",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "isatduxwmn"
        },
        "matched_queries" : [
          "query = isatduxwmn"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "Oa2PgI8BD4MPqc5jWoVg",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "lvrmulxqyp"
        },
        "matched_queries" : [
          "query = lvrmulxqyp"
        ]
      },
      {
        "_index" : "test_index",
        "_id" : "Oq2PgI8BD4MPqc5jWoVg",
        "_score" : 11.1074705,
        "_source" : {
          "test_text" : "bzwcblsdpi"
        },
        "matched_queries" : [
          "query = bzwcblsdpi"
        ]
      }
    ]
  }
}

shimpeko avatar May 16 '24 08:05 shimpeko

I can reproduce the same regression without the named queries @shimpeko . The issue as explained above is related to the number of term queries on a single request. The change in https://github.com/apache/lucene/pull/12183 introduces an overhead for large boolean queries composed of multiple term queries. It is emphasised when using named queries since they execute the query a second time during the fetch phase.

Regarding performance https://github.com/elastic/elasticsearch/issues/108659 is more important for us as we haven't figured out a workaround. For this (possible) named query issue, removing named queries worked for us as a workaround.

I suspect that https://github.com/elastic/elasticsearch/issues/108659 is a duplicate of this problem. Are you running lots of term queries (similar to this example) in a single boolean query? @javanna I wonder if we should allow to opt-out from https://github.com/apache/lucene/pull/12183? Using multiple threads to load terms can add a significant overhead when the number of terms is large as demonstrated in this issue.

jimczi avatar May 20 '24 23:05 jimczi

Thank you so much for the investigation. I appreciate it.

I suspect that https://github.com/elastic/elasticsearch/issues/108659 is a duplicate of this problem. Are you running lots of term queries (similar to this example) in a single boolean query?

I maybe misunderstood something but this example, the query on this issue, has multiple match queries in a single boolean query, not term queries.

Regarding https://github.com/elastic/elasticsearch/issues/108659, again they are match queries (not term queries) but yes, the programmatic (slow) production queries have 100+ multi_match queries in a boolean query. Just FYI, we can still observe a significant difference in create_weight value with a single match query in a boolean query between 8.7 and 8.13 as shared on https://github.com/elastic/elasticsearch/issues/108659.

I can reproduce the same regression without the named queries @shimpeko .

Thank you again for confirming the issue.

I now think that my previous comment "I've already removed named queries from our production query and confirmed it improved response time to the same level as 8.7 with named queries." is not correct. What might have actually happened was that I removed named queries from our production environment, which improved the 99th percentile response time; however, a small number of queries with many match queries remained slow. I thought this was a separate problem and opened another GitHub issue as https://github.com/elastic/elasticsearch/issues/108659.

opt-out from https://github.com/apache/lucene/pull/12183

This would really help us if it fixes this issue and https://github.com/elastic/elasticsearch/issues/108659. We are considering downgrading to 8.7 but it is a task as ES doesn't support downgrade.

shimpeko avatar May 21 '24 09:05 shimpeko

@jimczi ^

shimpeko avatar May 24 '24 08:05 shimpeko

@shimpeko

I maybe misunderstood something but this example, the query on this issue, has multiple match queries in a single boolean query, not term queries.

Those match queries will be converted to term queries (unless a prefix query is used) https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/index/search/MatchQueryParser.java#L523

andreidan avatar May 31 '24 15:05 andreidan

This looks to be a duplicate of #108659. It isn't specific to named queries, although its impact may be made more visible by named queries in the response time, due to their additional overhead.

javanna avatar Jun 13 '24 14:06 javanna

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine avatar Jul 12 '24 15:07 elasticsearchmachine

This issue has been addressed in Lucene 10, which Elasticsearch 9.0 will depend on. createWeight is parallelized differently, no longer by creating one task per segment per field, which ended creating way more tasks than threads available. See #115932 for a workaround to be included in Elasticsearch 8.16 , which limits the number of tasks a single search call can create.

javanna avatar Oct 30 '24 19:10 javanna