fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Hitting limit on max_prepared_stmt_count

Open mason-buettner opened this issue 6 months ago • 3 comments

Fleet version: 4.53.2

Web browser and operating system: TODO


💥  Actual behavior

customer-blanco is observing the following server errors after upgrading Orbit from 1.19 to 1.37.

err="internal error: fetch base config: load team agent options for host: select team: Error 1461 (42000): Can't create more than max_prepared_stmt_count statements (current value: 16382)"

customer-blanco is seeing up to 500k of these errors per minute.

🧑‍💻  Steps to reproduce

  1. TODO
  2. TODO

🕯️ More info (optional)

N/A

mason-buettner avatar Jun 20 '25 17:06 mason-buettner

More context (redacted log):

component=http method=POST uri=/api/v1/osquery/config took=993.336µs err="internal error: fetch base config: load team agent options for host: select team: Error 1461 (42000): Can't create more than max_prepared…

iansltx avatar Jun 20 '25 17:06 iansltx

Per further investigation, this appears to be an infrastructure configuration issue. The query that's failing is a simple "select this value from the team where ID = ?" that's just being accessed highly concurrently.

Normally team agent options are cached so Redis would serve the traffic, but this appears to be a thundering herd issue where the data isn't getting retrieved from the database so the cache can't be written.

There may be a workaround for this, and this workaround could potentially be cleanly applied to the version the customer is running, but infrastructure config seems to be the foremost issue here.

iansltx avatar Jun 20 '25 18:06 iansltx

Based on additional info gathered in troubleshooting, while all signs point to improperly tuned infrastructure, there are a few places where we're currently using prepared statements where we could get by without them. The commits above clean up two of them, and I'll be investigating a third shortly in its own PR. We'll need to load-test the fix here to confirm that Go isn't creating a prepared statement anyway, but this fix should clean up prepared statement usage by a fair amount and probably decrease database load a little at scale.

iansltx avatar Jun 20 '25 20:06 iansltx

@jmwatts @PezHub QA for this, once both PRs are merged, is covered in normal smoke tests, plus ensuring Windows and macOS MDM migrations work, and macOS Setup Experience works.

For load testing, I'd expect measurably lower database load (probably slight) compared to versions of Fleet without the fixes applied, and measurably lower results for SHOW GLOBAL STATUS LIKE '%prepared_stmt_count%'; and potentially SHOW GLOBAL STATUS LIKE '%Threads_connected%';. I would expect the differences to be more significant with MDM turned on than with it turned off.

iansltx avatar Jun 24 '25 02:06 iansltx

@iansltx in the load test environment I'm getting:

Writer instance: 143 for SHOW GLOBAL STATUS LIKE '%Threads_connected%'; and 5 for SHOW GLOBAL STATUS LIKE '%prepared_stmt_count%';

Reader instance: 183 for SHOW GLOBAL STATUS LIKE '%Threads_connected%'; and 535 for SHOW GLOBAL STATUS LIKE '%prepared_stmt_count%';

This is with 10k hosts, a bunch of which are currently processing software installs and refetches.

I'm not sure what the baseline I'm looking for is. Thoughts?

jmwatts avatar Jul 11 '25 20:07 jmwatts

@jmwatts Checking internally if we have an available <= 4.70.1 env to compare with, so we don't have to set up a load test env on an older version to validate this.

iansltx avatar Jul 11 '25 20:07 iansltx

@iansltx On new non-MDM fleet instance with 10k hosts, mostly idle:

Writer instance: 139 for SHOW GLOBAL STATUS LIKE '%Threads_connected%'; 6 for SHOW GLOBAL STATUS LIKE '%prepared_stmt_count%';

Reader instance: 138 for SHOW GLOBAL STATUS LIKE '%Threads_connected%'; 386 for SHOW GLOBAL STATUS LIKE '%prepared_stmt_count%';

jmwatts avatar Jul 14 '25 20:07 jmwatts

On a new non-MDM 4.70.1 instance with 10k hosts mostly idle:

Writer instance: 121 for SHOW GLOBAL STATUS LIKE '%Threads_connected%'; 5 for SHOW GLOBAL STATUS LIKE '%prepared_stmt_count%';

Reader instance: 138 for SHOW GLOBAL STATUS LIKE '%Threads_connected%'; 373 for SHOW GLOBAL STATUS LIKE '%prepared_stmt_count%';

jmwatts avatar Jul 14 '25 23:07 jmwatts

@jmwatts @PezHub QA for this, once both PRs are merged, is covered in normal smoke tests, plus ensuring Windows and macOS MDM migrations work, and macOS Setup Experience works.

MDM test results -

  • Confirmed macOS 3rd Party MDM migrations work
  • Confirmed Windows 3rd party MDM migrations work
  • Confirmed Setup Experience for macOS is working

PezHub avatar Jul 17 '25 18:07 PezHub

Upgrading Orbit brings, Errors swarm, a sea of bugs. Fleet's fix brings calm winds.

fleet-release avatar Jul 23 '25 22:07 fleet-release