elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Add queries stats to _cluster/stats API

Open mayya-sharipova opened this issue 3 years ago • 3 comments
trafficstars

  • Query types are counted by a coordinating node
  • Query types are used in the original form as submitted by a user.
  • Query types are deduplicated per search request (e.g. if a search request has 1 bool query and 5 term queries, this is counted as 1 bool query and 1 term query).

mayya-sharipova avatar Sep 21 '22 01:09 mayya-sharipova

Documentation preview:

github-actions[bot] avatar Sep 21 '22 01:09 github-actions[bot]

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine avatar Sep 21 '22 01:09 elasticsearchmachine

Hi @mayya-sharipova, I've created a changelog YAML for you.

elasticsearchmachine avatar Sep 21 '22 01:09 elasticsearchmachine

Like I said above, I've been looking into tracking queries while parsing them. I see the following options:

  1. add a context argument to all the *QueryBuilder#fromXContent methods . This is the ideal way to share context, where we could add the query tracking logic similar to what we are doing in this PR with the rewrite context. This change would touch quite some classes but is mostly mechanical, the main disadvantage I see is that it is a breaking change for plugin developers, as the context argument would become required for queries that are provided by plugins. We can still go ahead and make this change if we think it make sense, but we should look at alternatives too.

  2. wrap the xcontent parser and specialize the named object parsing, which is where we can track queries that get parsed: I opened a draft PR to show what I mean: https://github.com/elastic/elasticsearch/pull/90425 .

While on one hand this is additional effort, I think we should try to add the query tracking where it fits best. One consequence of doing the tracking at parsing is also around counting what comes at REST, so not counting CCS calls twice, and counting incoming queries as they are, before any rewrite.

What do you think @jtibshirani @mayya-sharipova ?

javanna avatar Sep 27 '22 19:09 javanna

I thought about it more, and noticed a surprising behavior when using rewrite to track. If the query rewrites on the coordinating node, then we record both the original and rewritten forms. This is because we call AbstractQueryBuilder#rewrite again and again until the query doesn't change. I tried the current code with a terms lookup query that doesn't match any documents. In the stats you see both the original query and the rewritten form: {terms=1, match_none=1}.

To me using parsing makes sense and feels low overhead. I don't have an idea for the cleanest way to do it though.

Another option would be to add a method QueryBuilder#visit that lets you visit the whole query tree and collect types. We have the same thing for Lucene queries Query#visit, and use it to enforce the total clauses limit. To me this seems like the clearest design, and might not be too much work because we could have a default implementation on AbstractQueryBuilder (?)

jtibshirani avatar Sep 28 '22 21:09 jtibshirani

I wonder if this'd make more sense in the node stats API (perhaps also adding these summary stats to the cluster stats API). I can imagine cases where it would be useful to see this broken down by node.

Hey @DaveCTurner we have discussed this in length with the team and decided that there is no value in having these stats part of nodes stats, because what we want to track is incoming queries at the cluster level, and not which queries are received by which coordinating node. It could be different if we were tracking queries on the data nodes, which we did in a previous iteration but we decided to move to the coordinating node based on product requirements.

javanna avatar Sep 29 '22 08:09 javanna

@jtibshirani thanks for digging, I do see how counting on rewrite can lead to surprising behaviors.

The visit idea is something I considered too. I am not sure that we should introduce a separate internal API to visit queries, given that we can do what we need at parse time: in practice we would parse incoming queries, and immediately after traverse the tree of what we just parsed to track queries usage? Could the visit mechanism be used for other purposes in the future? It could be useful if we wanted to analyze the query tree after rewrite, but we could then introduce it later?

@romseygeek you may have thoughts on this.

javanna avatar Sep 29 '22 08:09 javanna

I've thought in the past that a visit API might be helpful, but actually I think the parser-wrapping PR that Luca linked here is a nicer way to implement it. I'd be interested to see how the depth validation in the PR combines with query type tracking - possibly we can abstract it further and have something like a query visiting parser that takes a collection of callbacks to be triggered each time we move to a subquery.

romseygeek avatar Sep 29 '22 08:09 romseygeek

I thought about it more, and noticed a surprising behavior when using rewrite to track. If the query rewrites on the coordinating node, then we record both the original and rewritten forms. This is because we call AbstractQueryBuilder#rewrite again and again until the query doesn't change. I tried the current code with a terms lookup query that doesn't match any documents. In the stats you see both the original query and the rewritten form: {terms=1, match_none=1}.

@jtibshirani That's a very good point. I was under impression that all queries we need to have a shard context for a query to be rewritten; while on a coordinating node they never get rewritten. But as an example with terms query demonstrated this is not true.

I agree this is a major downside of counting queries on rewrite, and we should explore an alternative approach.

mayya-sharipova avatar Sep 29 '22 20:09 mayya-sharipova

I've synchronized with @javanna offline, and we thought the best way forward would be to proceed with his parsing approach https://github.com/elastic/elasticsearch/pull/90425, but there is a downside with it is that not all queries are going though XContent parsing:

  • queries submitted by internal applications (EQL, ML, Security) may not go through parsing. Is it important to count them? We are inclined to think that it is not important, as a goal of this stats would be to count user submitted queries.
  • in tests we usually don't use parsing. This is not a blocker, as we can design tests with parsing.

Thus, I am closing this PR in favour of https://github.com/elastic/elasticsearch/pull/90425 which we agreed that @javanna will enhance to add stats.

mayya-sharipova avatar Sep 29 '22 20:09 mayya-sharipova