quickwit Slow Query History / Log

Is your feature request related to a problem? Please describe. Quickwit should be able to identify slow queries and store a record about its execution potentially identifying why it is slow.

Describe the solution you'd like

I think records in the metastore is perfectly reasonable

ability to define a time threshold at which point a query is "slow"
a record of the original input query (elastic DSL or quickwit native)
a record of execution time
a record of the index list targeted (wild cards expanded)
a record of the generated query ast that was used to exec the query
a record of if the query targeted the configured timestamp field
a record of the constraints around the timestamp field
number of leave nodes targeted (or even the list of leaf nodes at the time)
approximated number of documents returned by the query
approximated number of documents matched by the query
configurable ability to auto explain slow queries.
number of splits downloaded during query and or bytes download
amount of time spent downloading splits
configurable amount of time after which records can be purged (the data loses relevance over time)
surfaced via an API would be 👌

CREATE TABLE IF NOT EXISTS slow_search_query (
  query_timestamp TIMESTAMP NOT NULL DEFAULT NOW()
, exec_time_ms BIGINT NOT NULL DEFAULT 0
, split_download_count BIGINT NOT NULL DEFAULT 0
, split_download_time_ms BIGINT NOT NULL DEFAULT 0
, split_download_bytes BIGINT NOT NULL DEFAULT 0
, documents_matched BIGINT NOT NULL DEFAULT 0
, documents_returned BIGINT NOT NULL DEFAULT 0
, timestamp_field_used BOOLEAN NOT NULL DEFAULT FALSE
, target_index TEXT[] NOT NULL
, leaf_targets  TEXT[] NOT NULL DEFAULT '{}'::TEXT[] # (or BIGINT NOT NULL DEFAULT 0)
, query_input TEXT NOT NULL
, query_ast TEXT NOT NULL
, query_explain: TEXT NULL DEFAULT NULL
)

Describe alternatives you've considered Attempting to log and/or store in our application db.

Additional context The metrics reported by the searcher leave some gaps in understanding why queries are performing poorly.

The problem with applications implementing query logging is that this likely means modifying a large number of applications to try to capture any search query being executed and trying to record as much information as we can determine from the response. Error responses tend to be hard to understand. Most of the reasons for a query being slow aren't known by client applications.

It additionally means any system running a decent search workload needs to stand up additional infrastructure to track slow queries, and is still missing important information needed to debug a query. It becomes and retroactive snipe hunt of trying to correlate information from multiple systems to understand what is happening (logs, and metrics, and db records, etc)

Sep 06 '24 17:09 esatterwhite

One thing to consider, is if the search nodes are interfacing with a metastore that cant not issue writes (a pg replica). Might make things complicated

Sep 06 '24 17:09 esatterwhite

It doesn't necessarily need to be in postgres / metastore. But it would be very beneficial if one is able to query it. It's a data set and use case the lends it self really well to SQL

It could be a system maintained index with a known schema for example.

Just something other than stdout logging.

Sep 13 '24 15:09 esatterwhite