fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Poor performance with query reports

Open rfairburn opened this issue 2 months ago • 20 comments

Fleet version: v4.76.0 and below

Web browser and operating system: N/A


💥  Actual behavior

More and more customers are wanting to leverage query reports in one way or another as part of their scheduled queries, however our eviction test method is very inefficient and even for 1-2 queries matching all hosts on an instance will consume as much as 50% of a database or more depending upon scaling. The hot query looks to be:

SELECT COUNT ( * ) FROM `query_results` WHERE `query_id` = ? AND DATA IS NOT NULL

which as far as I can tell is run once for each row that is inserted in order to see if eviction needs to happen. This makes the eviction test significantly more impactful than the INSERT itself.

Here are some examples of real-world loads when this feature is turned on:

Reader:

Image

Writer:

Image

The actual inserts themselves or SELECTs of the data never show up as hot items at all.

🛠️ To fix

Find another method of evicting old items even if it is done lazily.

Alternatively, always accept 1 row per host per query and UPSERT the entry for that host. Even with a large number of queries/hosts, this is likely less impactful. If we do this, we'd also want to make sure to clean up host_ids removed on host deletion as well as the assumption would be this isn't to hold long-term data and that log destinations would still be the target for this, but one result per host would be at least usable.

Or do both. Lazy delete on a cron for older entries and UPSERT as the preferred method of not over-growing the table.

If we need more performance here, partitioning strategies could be leveraged as well.

I am also open to other options.

🧑‍💻  Steps to reproduce

  1. Schedule live query
  2. Have query reporting on and enabled on query
  3. Target host count somewhere in the thousands to tens of thousands.

🕯️ More info (optional)

  • @noahtalerman: The queries numa is running are in drive (Google doc).

rfairburn avatar Nov 12 '25 08:11 rfairburn

We know the existing implementation is not performant and the first thing we recommend seems to be turning it off for installs of any size whatsoever, but customers are really wanting to use it, and it might not take much to make at least more usable even if not perfect.

@lukeheath I think this bug deserved a P2. It's preventing customers from using the query reports feature.

noahtalerman avatar Nov 12 '25 15:11 noahtalerman

@noahtalerman @rfairburn How much overlap does this have with #35484? We should probably tweak specs of that ticket so it solves a different problem than this one.

iansltx avatar Nov 12 '25 15:11 iansltx

How much overlap does this have with https://github.com/fleetdm/fleet/issues/35484?

I'm not sure...

@sgress454 I think let's chat about this during the next estimation.

noahtalerman avatar Nov 12 '25 17:11 noahtalerman

We estimated 13 for discovery + fix, kicking back to @noahtalerman for more research.

sgress454 avatar Nov 13 '25 17:11 sgress454

@zayhanlon are you able to get a list of queries that customer-numa is running?

sgress454 avatar Nov 13 '25 17:11 sgress454

@rfairburn are you able to pull this without me asking numa to do more stuff for us? if not, no worries. i'll send them a note

zayhanlon avatar Nov 13 '25 17:11 zayhanlon

please don't include the list publicly in this issue

zayhanlon avatar Nov 13 '25 17:11 zayhanlon

I can pull and share in slack or in a linked private ticket.

rfairburn avatar Nov 13 '25 18:11 rfairburn

@sgress454 @iansltx I know that one thought might be instead of doing limits per query to do it per query-host combo (with a limit of 1 by default). When doing this, the selects are much faster to determine counts:

MySQL [fleet]> select count(*) from query_results where query_id = AAAA and host_id = BBBB;
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.000 sec)

MySQL [fleet]> select count(*) from query_results where query_id = AAAA;
+----------+
| count(*) |
+----------+
|    37270 |
+----------+
1 row in set (0.011 sec)

MySQL [fleet]> select count(*) from query_results;
+----------+
| count(*) |
+----------+
|   918557 |
+----------+
1 row in set (0.218 sec)

My thought here is that this also in the case of deleting an old row before inserting, deletes from other responders wouldn't cause the plans for these selects for limits to change while querying and therefore not add lock times. This might allow us to operate in a similar fashion without as much overhead or other contention.

There is some open question about what we would do in the case when something was over the per-host limit. Do we just stop forever for that host per the existing per-query or do we clear the oldest items over the count so we make room for our newest result?

I'm not sure what the best path forward is, but this would give some runway.

rfairburn avatar Nov 13 '25 18:11 rfairburn

Do we have any requirements / expectation / documentation showing that we will hold > 1 row per host per query? If not, and we're ok limiting to 1, then upserting seems like the simplest, most performant solution. Otherwise there's different options depending on what scale we need to support.

sgress454 avatar Nov 13 '25 19:11 sgress454

My bet is that some customer queries are one row per host per query (or a set number), while others are an arbitrary number of rows. My other bet is that the former cases are the cases that slow things down now.

If my bets pan out, allowing folks to opt for eviction based on per-host result counts on a per query basis instead of per-query would allow us to run everything else with smaller result count limits. But we need to confirm workload information first to see if this is viable.

iansltx avatar Nov 13 '25 19:11 iansltx

I discovered that there were queries with very frequent intervals turned on in the environment here targeting a number of hosts over 50k. One had an interval as short as 600s and was causing 10+ inserts per second. I'm not sure if these were turned on intentionally, but they are likely significant contributions to the load we were seeing and I would assume also end up outside the scope of anything we could realistically want to support.

Supporting a large number of hosts and/or queries in reports would need to still be heavily caveated that a frequent schedule can still break things. The DB usage is multiplicative queries * hosts * frequency when assuming multiple frequent queries targeting most/all hosts. Especially when the select count that happens before delete and re-insert could lock the surrounding requests until it completed.

rfairburn avatar Nov 13 '25 20:11 rfairburn

I can pull and share in slack or in a linked private ticket.

@rfairburn thanks!

noahtalerman avatar Nov 13 '25 20:11 noahtalerman

Linked to Unthread ticket:

User request to add the cap back in #10053

Sampfluger88 avatar Nov 13 '25 21:11 Sampfluger88

Image

deferring to Ian on this

sgress454 avatar Nov 19 '25 20:11 sgress454

Hey @ddribeiro I assigned this bug to you. On the next call w/ numa can you please help us figure out which of their queries they want to have query reports for and which they don't? We're assuming some of these queries don't need reports in the Fleet UI. Their results are just sent to their data lake.

Knowing this will help us provide them a best practice and understand what product changes we want to make.

More context in this Slack thread.

cc @zayhanlon

noahtalerman avatar Nov 20 '25 15:11 noahtalerman

@sharon-fdm Priority issue approval up to you.

lukeheath avatar Nov 24 '25 17:11 lukeheath

@noahtalerman @rachaelshaw from numa: Unable to specify exact queries as the recurring use cases will be different. However, they would expect:

  • At max, 10 rows, 10 columns resulting in output per query * per device

  • On average 1 row, 5 columns resulting in output per query * per device

  • @noahtalerman: How many queries can I add?

zayhanlon avatar Dec 09 '25 14:12 zayhanlon

hey @rachaelshaw and @noahtalerman - can i get a status update on spec for this? trying to understand when we may be able to get this into a sprint?

zayhanlon avatar Dec 10 '25 20:12 zayhanlon

FYI @rachaelshaw added this to our 1:1 today to get this bug moving.

cc @zayhanlon

noahtalerman avatar Dec 12 '25 14:12 noahtalerman

Moving @rfairburn's original fix here:

🛠️ To fix

Find another method of evicting old items even if it is done lazily.

Alternatively, always accept 1 row per host per query and UPSERT the entry for that host. Even with a large number of queries/hosts, this is likely less impactful. If we do this, we'd also want to make sure to clean up host_ids removed on host deletion as well as the assumption would be this isn't to hold long-term data and that log destinations would still be the target for this, but one result per host would be at least usable.

Or do both. Lazy delete on a cron for older entries and UPSERT as the preferred method of not over-growing the table.

If we need more performance here, partitioning strategies could be leveraged as well.

I am also open to other options.

rachaelshaw avatar Dec 17 '25 21:12 rachaelshaw

@zayhanlon updated this issues "To fix" with a plan Noah and I discussed the other day.

If the specified plan is estimated too high to justify bringing into the upcoming sprint, we will instead load test w/ no code changes.

rachaelshaw avatar Dec 17 '25 21:12 rachaelshaw