quickwit
quickwit copied to clipboard
phrase_prefix match on random value causes slow querys and spike memory usage
Describe the bug A clear and concise description of what the bug is.
Steps to reproduce (if applicable) Steps to reproduce the behavior:
Example Request
[!CAUTION] The index pattern in the request expands to
4 indexes
and contains3,869,958,639 documents
in total
/api/v1/_elastic/logline.996226df4b.2024-06-03*,logline.996226df4b.2024-06-04*/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"multi_match": {
"type": "phrase_prefix",
"query": "6c59f652-f1f5-11ee-86b2-562c83a610e2",
"fields": [
"_line"
],
"lenient": true
}
}
]
}
}
}
[!NOTE] It seems like its the length of the query we're trying to match here, if I shorten the search query, it'll finish but the time increases the longer it gets
6c59f652
- 19556 ms6c59f652-f1f5-11ee
- 32743 ms6c59f652-f1f5-11ee-86b2-562c83a610e2
- Did not finish
The memory usage spikes above defined memory limits (4gb), and the longer running queries take long enough that kubernetes OOMKiller is prompted to terminate the pod running the query.
In comparison to the timings presented by elasticsearch on a near identical data set
-
6c59f652
5277 ms -
6c59f652-f1f5-11ee
- 3378 ms -
6c59f652-f1f5-11ee-86b2-562c83a610e2
- 5365 ms
Example Document
{
"__key": "logline:996226df4b:webapp:ldw-5c4b45dd7d-w94fv:ld11",
"_account": "996226df4b",
"_app": "webapp",
"_bid": "34c1c7cb-e8c4-4190-afcc-5c2591384b7e:31899:ld11",
"_cluster": "ld11",
"_file": "/var/log/containers/ldw-5c4b45dd7d-w94fv_default_webapp-0fe267d84862d915a960d4159a47f52d560b4dfd072a1b6d4691c3979a5c3369.log",
"_host": "ldw-5c4b45dd7d-w94fv",
"_id": "1751243621446717490",
"_ingester": "logdna-agent/3.10.0-beta.1 (Amazon,Linux/2)",
"_ip": "***REDACTED***",
"_label": {
"app": "ldw",
"pod-template-hash": "5c4b45dd7d"
},
"_lid": "1751243621446717490",
"_line": "[240604 19:23:32] [info] {\"message\":\"Proxy request from 2a6204e8c7 to /v3/pipeline/6c59f652-f1f5-11ee-86b2-562c83a610e2\",\"meta\":{\"stack\":[\"at ProxyServer.<anonymous> (/opt/app/server/config/proxies/proxy.js:15:12)\",\"at ProxyServer.emit (/opt/app/node_modules/eventemitter3/index.js:210:27)\",\"at ClientRequest.<anonymous> (/opt/app/node_modules/http-proxy/lib/http-proxy/passes/web-incoming.js:133:16)\",\"at ClientRequest.emit (node:events:519:28)\",\"at ClientRequest.emit (node:domain:488:12)\",\"at tickOnSocket (node:_http_client:839:7)\",\"at onSocketNT (node:_http_client:902:5)\",\"at process.processTicksAndRejections (node:internal/process/task_queues:83:21)\"],\"version\":\"8.5.108\",\"commit\":\"32c4bf268d\"}}",
"_logtype": "json",
"_meta": {
"Image Name": "us.gcr.io/logdna-k8s/logdna-web",
"Tag": "8.5.108"
},
"_mezmo_line_size": "1305",
"_originating_user_agent": null,
"_tag": [
"agent-v2",
"prod",
"us-east-1",
"aws"
],
"_ts": 1717529012835,
"container": "webapp",
"containerid": "0fe267d84862d915a960d4159a47f52d560b4dfd072a1b6d4691c3979a5c3369",
"level": "info",
"message": "Proxy request from 2a6204e8c7 to /v3/pipeline/6c59f652-f1f5-11ee-86b2-562c83a610e2",
"namespace": "default",
"node": "***REDACTED***",
"o_meta": {
"commit": "32c4bf268d",
"stack": [
"at ProxyServer.<anonymous> (/opt/app/server/config/proxies/proxy.js:15:12)",
"at ProxyServer.emit (/opt/app/node_modules/eventemitter3/index.js:210:27)",
"at ClientRequest.<anonymous> (/opt/app/node_modules/http-proxy/lib/http-proxy/passes/web-incoming.js:133:16)",
"at ClientRequest.emit (node:events:519:28)",
"at ClientRequest.emit (node:domain:488:12)",
"at tickOnSocket (node:_http_client:839:7)",
"at onSocketNT (node:_http_client:902:5)",
"at process.processTicksAndRejections (node:internal/process/task_queues:83:21)"
],
"version": "8.5.108"
},
"pod": "ldw-5c4b45dd7d-w94fv"
}
Expected behavior I would generally expect memory to remain under control, and/or be released after the query completes.. There don't seem to be any settings to help control memory. We have strict limits on memory for quickwit running on kubernetes, But quickwit itself seems to have an unbounded memory usage cap internally (or a leak).
Every time this query is run, memory makes a significant jump and never seems to come down.
Additionally, query times are very slow for this kind of query and fail to complete in a reasonable amount of time (10-30seconds).
Configuration:
- Output of
quickwit --version
Quickwit 0.8.0 (aarch64-unknown-linux-gnu 2024-06-04T12:57:54Z 67a798d)
- The index_config.yaml
version: '0.8'
# metastore_uri: postgres connection string generated by postgres operator
default_index_root_uri: s3://***REDACTED***/indexes
searcher:
fast_field_cache_capacity: 2G # default 1G
aggregation_memory_limit: 500M # default 500M
partial_request_cache_capacity: 64M # default 64M
max_num_concurrent_split_searches: 100 # default 100
split_cache:
max_num_bytes: 2000G
max_num_splits: 1000000
num_concurrent_downloads: 4
indexer:
enable_cooperative_indexing: true
enable_otlp_endpoint: true
cpu_capacity: 1000m
ingest_api:
max_queue_disk_usage: 4GiB
storage:
s3:
region: us-east-1
endpoint: https://s3.amazonaws.com