Poor performance with query reports
Fleet version: v4.76.0 and below
Web browser and operating system: N/A
💥 Actual behavior
More and more customers are wanting to leverage query reports in one way or another as part of their scheduled queries, however our eviction test method is very inefficient and even for 1-2 queries matching all hosts on an instance will consume as much as 50% of a database or more depending upon scaling. The hot query looks to be:
SELECT COUNT ( * ) FROM `query_results` WHERE `query_id` = ? AND DATA IS NOT NULL
which as far as I can tell is run once for each row that is inserted in order to see if eviction needs to happen. This makes the eviction test significantly more impactful than the INSERT itself.
Here are some examples of real-world loads when this feature is turned on:
Reader:
Writer:
The actual inserts themselves or SELECTs of the data never show up as hot items at all.
🛠️ To fix
Find another method of evicting old items even if it is done lazily.
Alternatively, always accept 1 row per host per query and UPSERT the entry for that host. Even with a large number of queries/hosts, this is likely less impactful. If we do this, we'd also want to make sure to clean up host_ids removed on host deletion as well as the assumption would be this isn't to hold long-term data and that log destinations would still be the target for this, but one result per host would be at least usable.
Or do both. Lazy delete on a cron for older entries and UPSERT as the preferred method of not over-growing the table.
If we need more performance here, partitioning strategies could be leveraged as well.
I am also open to other options.
🧑💻 Steps to reproduce
- Schedule live query
- Have query reporting on and enabled on query
- Target host count somewhere in the thousands to tens of thousands.
🕯️ More info (optional)
- @noahtalerman: The queries
numais running are in drive (Google doc).
We know the existing implementation is not performant and the first thing we recommend seems to be turning it off for installs of any size whatsoever, but customers are really wanting to use it, and it might not take much to make at least more usable even if not perfect.
@lukeheath I think this bug deserved a P2. It's preventing customers from using the query reports feature.
@noahtalerman @rfairburn How much overlap does this have with #35484? We should probably tweak specs of that ticket so it solves a different problem than this one.
How much overlap does this have with https://github.com/fleetdm/fleet/issues/35484?
I'm not sure...
@sgress454 I think let's chat about this during the next estimation.
We estimated 13 for discovery + fix, kicking back to @noahtalerman for more research.
@zayhanlon are you able to get a list of queries that customer-numa is running?
@rfairburn are you able to pull this without me asking numa to do more stuff for us? if not, no worries. i'll send them a note
please don't include the list publicly in this issue
I can pull and share in slack or in a linked private ticket.
@sgress454 @iansltx I know that one thought might be instead of doing limits per query to do it per query-host combo (with a limit of 1 by default). When doing this, the selects are much faster to determine counts:
MySQL [fleet]> select count(*) from query_results where query_id = AAAA and host_id = BBBB;
+----------+
| count(*) |
+----------+
| 1 |
+----------+
1 row in set (0.000 sec)
MySQL [fleet]> select count(*) from query_results where query_id = AAAA;
+----------+
| count(*) |
+----------+
| 37270 |
+----------+
1 row in set (0.011 sec)
MySQL [fleet]> select count(*) from query_results;
+----------+
| count(*) |
+----------+
| 918557 |
+----------+
1 row in set (0.218 sec)
My thought here is that this also in the case of deleting an old row before inserting, deletes from other responders wouldn't cause the plans for these selects for limits to change while querying and therefore not add lock times. This might allow us to operate in a similar fashion without as much overhead or other contention.
There is some open question about what we would do in the case when something was over the per-host limit. Do we just stop forever for that host per the existing per-query or do we clear the oldest items over the count so we make room for our newest result?
I'm not sure what the best path forward is, but this would give some runway.
Do we have any requirements / expectation / documentation showing that we will hold > 1 row per host per query? If not, and we're ok limiting to 1, then upserting seems like the simplest, most performant solution. Otherwise there's different options depending on what scale we need to support.
My bet is that some customer queries are one row per host per query (or a set number), while others are an arbitrary number of rows. My other bet is that the former cases are the cases that slow things down now.
If my bets pan out, allowing folks to opt for eviction based on per-host result counts on a per query basis instead of per-query would allow us to run everything else with smaller result count limits. But we need to confirm workload information first to see if this is viable.
I discovered that there were queries with very frequent intervals turned on in the environment here targeting a number of hosts over 50k. One had an interval as short as 600s and was causing 10+ inserts per second. I'm not sure if these were turned on intentionally, but they are likely significant contributions to the load we were seeing and I would assume also end up outside the scope of anything we could realistically want to support.
Supporting a large number of hosts and/or queries in reports would need to still be heavily caveated that a frequent schedule can still break things. The DB usage is multiplicative queries * hosts * frequency when assuming multiple frequent queries targeting most/all hosts. Especially when the select count that happens before delete and re-insert could lock the surrounding requests until it completed.
I can pull and share in slack or in a linked private ticket.
@rfairburn thanks!
deferring to Ian on this
Hey @ddribeiro I assigned this bug to you. On the next call w/ numa can you please help us figure out which of their queries they want to have query reports for and which they don't? We're assuming some of these queries don't need reports in the Fleet UI. Their results are just sent to their data lake.
Knowing this will help us provide them a best practice and understand what product changes we want to make.
More context in this Slack thread.
cc @zayhanlon
@sharon-fdm Priority issue approval up to you.
@noahtalerman @rachaelshaw from numa: Unable to specify exact queries as the recurring use cases will be different. However, they would expect:
-
At max, 10 rows, 10 columns resulting in output per query * per device
-
On average 1 row, 5 columns resulting in output per query * per device
-
@noahtalerman: How many queries can I add?
hey @rachaelshaw and @noahtalerman - can i get a status update on spec for this? trying to understand when we may be able to get this into a sprint?
FYI @rachaelshaw added this to our 1:1 today to get this bug moving.
cc @zayhanlon
Moving @rfairburn's original fix here:
🛠️ To fix
Find another method of evicting old items even if it is done lazily.
Alternatively, always accept 1 row per host per query and UPSERT the entry for that host. Even with a large number of queries/hosts, this is likely less impactful. If we do this, we'd also want to make sure to clean up host_ids removed on host deletion as well as the assumption would be this isn't to hold long-term data and that log destinations would still be the target for this, but one result per host would be at least usable.
Or do both. Lazy delete on a cron for older entries and UPSERT as the preferred method of not over-growing the table.
If we need more performance here, partitioning strategies could be leveraged as well.
I am also open to other options.
@zayhanlon updated this issues "To fix" with a plan Noah and I discussed the other day.
If the specified plan is estimated too high to justify bringing into the upcoming sprint, we will instead load test w/ no code changes.