fleet
fleet copied to clipboard
Performance issues after upgrading fleetd
Fleet version: Fleet v4.30.0
💥 Actual behavior
After upgrading fleetd to v1.19.0, the customer observed that the online/offline host count was fluctuating.
The issue resolved itself briefly after restarting the MySQL database and Fleet instances, but hosts began to show as offline again over time.
When reviewing customer logs, I was able to identify that the customer has duplicate hosts that are continuously re-enrolling in Fleet. It is believed that this, coupled with added functionality in fleetd, overloaded the database.
I have requested a full breakdown of the customer environment so that we can work on relieving the immediate pressure and confirm this root-cause analysis.
In order to ultimately resolve these duplicate hosts, we will also need to get the customer upgraded to a version of Fleet that supports setting the host identifier used by Fleetd and osquery, but the customer is concerned that they will run in to issues with some of the older versions of fleetd that are present in the customer's environment.
While getting all hosts up-to-date is the ultimate goal, there are some blockers to that process, and there are currently versions of Fleetd prior to 1.0 checking in to Fleet.
I've checked to see if there are any anticipated issues with communication between these hosts and the current release, but do not have a solid answer on that.
While we continue working to gather additional information and resolve the immediate performance issues, I'm proposing that we run a test to confirm compatibility between all released versions of fleetd and the current release of Fleet.
🧑💻 Steps to reproduce
🕯️ More info (optional)
- Run Fleet
v4.45.0
locally. - Enroll hosts using every
fleetd
minor release from0.0.9
to the current version. - Confirm the host responds as expected to scheduled queries, live queries, and policy checks.
- If there are any anomalies, document what version of
fleetd
and what happened.
I have an archive available for review.
@ksatter, our environment to check mass number of hosts is based on a simulator agent, so not possible to create thousands of real fleetd hosts with version 1.19.0. However, we could ramp an environment with half a dozen-ish real fleetd 1.19.0 and see if they work with the latests fleet server. This would check if there is any incompatibility (But not any mass/load issues)
TMWYT
cc: @lucasmrod
@sharon-fdm The main concern is whether v4.45.0 is backwards compatible with the very early versions of fleetd (still waiting for the exact version number, but somewhere in the 0.9 range)
@ksatter Thanks for adding the context! I added a bulleted list of steps required for this issue under "More info" in the issue description. Does that look right?
@sharon-fdm It sounds like this is essentially a QA issue where we go through and check the last ~10 minor versions of Orbit/Desktop against Fleet v4.45.0
to make sure it as works as expected, and if not, document what doesn't work.
@noahtalerman From a product perspective, what is the expected interoperability between old fleetd
agents and the latest version of Fleet?
Hi folks.
A while ago we agreed to follow this strategy: https://github.com/fleetdm/fleet/blob/main/docs/Contributing/fleetd-development-and-release-strategy.md
@kswagler-rh Alternatively, can the customer stay on older versions of fleetd
? I would expect there to be issues running versions that far apart. The most straight-forward approach would be for the customer to stay on v4.30.0
and the corresponding fleetd
version (the version that was released at the same time). Then, when they're ready to upgrade the fleetd
version, they can upgrade Fleet at the same time.
Are there blockers to that approach?
A while ago we agreed to follow this strategy: https://github.com/fleetdm/fleet/blob/main/docs/Contributing/fleetd-development-and-release-strategy.md
Thanks, Lucas!
@kswagler-rh As defined in the document Lucas shared, our commitment is for new versions of fleetd
to work with older versions of Fleet. As for old versions of fletd
working with new versions of Fleet, there would be some issues because features that are available in Fleet would not yet be available on the hosts running old versions of fleetd
if the feature required changes to fleetd
.
what is the expected interoperability between old fleetd agents and the latest version of Fleet?
The below is the expected approach. We don't do a good job documenting this in the user facing docs. Improving this is covered by this issue: #16349
The most straight-forward approach would be for the customer to stay on v4.30.0 and the corresponding fleetd version (the version that was released at the same time). Then, when they're ready to upgrade the fleetd version, they can upgrade Fleet at the same time.
On Monday (2024-02-09), @lucasmrod, @sharon-fdm, @xpkoala, @ksatter and I are meeting to align this approach and document it.
@noahtalerman Apologies if I wasn't super clear, in this case, the customer would benefit greatly from getting Fleet and fleetd upgraded to a point where they can use the new host identifier functionality. Their concern is the "leagacy" installs of fleetd that they may not be able to get up-to-date.
There are two issues in the discussion above:
- Rules for compatibility. I am in favour of revisiting them next Monday (Mar 4th) with @noahtalerman
- Specifically for customer_erda, I agree with @lukeheath we could run some QA tests with old agents (~0.9) per our QA priorities (Have a few things in the pipe)
I am reclassifying this as a P2 per our new product group priority system. I am assigning @sharon-fdm and @noahtalerman to decide how to move forward on this.
reclassifying this as a P2 per our new product group priority system.
Got it.
I'm proposing that we run a test to confirm compatibility between all released versions of fleetd and the current release of Fleet.
@ksatter I don't understand why we need to test all released versions of fleetd.
Let's plan to chat about this during our "Discuss supported fleetd version" call on Monday.
We can also get input from Sharon and Lucas on the level of effort for this.
Please let me know if this needs to happen sooner.
@noahtalerman, I'm following up things. Would you like to set a discussion for this?
We met w/ the customer and decided to help them upgrade to a stable version (latest - 1) of the Fleet server.
Action items from the call:
- [ ] Ask the customer what features/workflows they're using. This will help inform the test server Fleet uses
- [x] Ask the customer for a list of all fleetd (agent) versions
After we have this info, the plan is to get the engineering team's help to run the tests.
Moving this to the customer-success board. @ksatter, we assigned this to you. Please let us know if you have any questions!
cc @Patagonia121 @lukeheath @pacamaster
As much information we can get about their environment is helpful. It will help us better replicate their environment and make our migration tests more effective.
@lukeheath @noahtalerman The customer replied with a list of all osquery agent versions in use across their Fleet:
Osquery 2.11.2 Osquery 3.3.2 Osquery 4.5.1 Osquery 4.6.0 Osquery 4.8.0 Osquery 5.2.2 Osquery 5.9.1 Osquery 5.11.0 Osquery 5.12.1
Their comment: "Obviously supporting all of these with the new version would be unreasonable, but if you guys can provide us a cut-off as to where these would be likely be unsupported that would greatly help"
@Patagonia121 Thank you!
@noahtalerman I think the next step is to run load tests on each version to determine which are performant. That's going to take some resources, so we should bring this through estimation so we can plan it into sprint capacity accordingly.
Sweet!
I added a user story to the issue description and started on the testing requirements.
@lukeheath I think this is less about performance and more about features the customers uses continuing to work.
@ksatter does that sound right?
In order to bring this one through estimation, I think we want a list of the features that the customer uses (ex. scheduled queries / live queries).
This way, we know what feature we want to test.
@ksatter can you please help us get this list of features? Once we have that list, please pass this issue to me (add :product
) so we can get the testing rolling.
I moved the original issue description below for safekeeping.
Fleet version: Fleet v4.30.0
💥 Actual behavior
After upgrading fleetd to v1.19.0, the customer observed that the online/offline host count was fluctuating.
The issue resolved itself briefly after restarting the MySQL database and Fleet instances, but hosts began to show as offline again over time.
When reviewing customer logs, I was able to identify that the customer has duplicate hosts that are continuously re-enrolling in Fleet. It is believed that this, coupled with added functionality in fleetd, overloaded the database.
I have requested a full breakdown of the customer environment so that we can work on relieving the immediate pressure and confirm this root-cause analysis.
In order to ultimately resolve these duplicate hosts, we will also need to get the customer upgraded to a version of Fleet that supports setting the host identifier used by Fleetd and osquery, but the customer is concerned that they will run in to issues with some of the older versions of fleetd that are present in the customer's environment.
While getting all hosts up-to-date is the ultimate goal, there are some blockers to that process, and there are currently versions of Fleetd prior to 1.0 checking in to Fleet.
I've checked to see if there are any anticipated issues with communication between these hosts and the current release, but do not have a solid answer on that.
While we continue working to gather additional information and resolve the immediate performance issues, I'm proposing that we run a test to confirm compatibility between all released versions of fleetd and the current release of Fleet.
🧑💻 Steps to reproduce
🕯️ More info (optional)
- Run Fleet
v4.45.0
locally. - Enroll hosts using every
fleetd
minor release from0.0.9
to the current version. - Confirm the host responds as expected to scheduled queries, live queries, and policy checks.
- If there are any anomalies, document what version of
fleetd
and what happened.
I have an archive available for review.
cc @Patagonia121 ^^
@noahtalerman
In order to bring this one through estimation, I think we want a list of the features that the customer uses (ex. scheduled queries / live queries).
Agent communication, live queries and query packs would be the key pieces
Thanks for adding the features to test @ksatter!
Heads up that I updated the plan to test w/ Fleet 4.49.4. I think this means we'll recommend that the customer upgrades to 4.49.4 after testing.
Note that the customer wanted to upgrade to latest - 1 (N - 1).
Let me know if you have any concerns with that.
cc @zayhanlon
Hey @zayhanlon FYI the testing for customer-erda
is scheduled to start next sprint (2024-06-03 kickoff).
@sharon-fdm heads up, I just moved this story into the specified column. Can you please work w/ the team to get it estimated today?
@ksatter FYI on timelines (eta for 4.52.0 is 6/21)
moved to Waiting pending clarity on testing requirements
@ksatter @Patagonia121 when ready please approve the doc so we can close this.
Sounds good @sharon-fdm, I'm going to sync with @ksatter today and we should be able to close this out after we discuss. Thanks!
@sharon-fdm this looks good, we can hand off this doc tomorrow to the customer during our regular call. Go ahead and close it out. Thanks!