fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Performance issues after upgrading fleetd

Open ksatter opened this issue 1 year ago • 13 comments

Fleet version: Fleet v4.30.0


💥  Actual behavior

After upgrading fleetd to v1.19.0, the customer observed that the online/offline host count was fluctuating.

The issue resolved itself briefly after restarting the MySQL database and Fleet instances, but hosts began to show as offline again over time.

When reviewing customer logs, I was able to identify that the customer has duplicate hosts that are continuously re-enrolling in Fleet. It is believed that this, coupled with added functionality in fleetd, overloaded the database.

I have requested a full breakdown of the customer environment so that we can work on relieving the immediate pressure and confirm this root-cause analysis.

In order to ultimately resolve these duplicate hosts, we will also need to get the customer upgraded to a version of Fleet that supports setting the host identifier used by Fleetd and osquery, but the customer is concerned that they will run in to issues with some of the older versions of fleetd that are present in the customer's environment.

While getting all hosts up-to-date is the ultimate goal, there are some blockers to that process, and there are currently versions of Fleetd prior to 1.0 checking in to Fleet.

I've checked to see if there are any anticipated issues with communication between these hosts and the current release, but do not have a solid answer on that.

While we continue working to gather additional information and resolve the immediate performance issues, I'm proposing that we run a test to confirm compatibility between all released versions of fleetd and the current release of Fleet.

🧑‍💻  Steps to reproduce

🕯️ More info (optional)

  1. Run Fleet v4.45.0 locally.
  2. Enroll hosts using every fleetd minor release from 0.0.9 to the current version.
  3. Confirm the host responds as expected to scheduled queries, live queries, and policy checks.
  4. If there are any anomalies, document what version of fleetd and what happened.

I have an archive available for review.

ksatter avatar Feb 24 '24 00:02 ksatter

@ksatter, our environment to check mass number of hosts is based on a simulator agent, so not possible to create thousands of real fleetd hosts with version 1.19.0. However, we could ramp an environment with half a dozen-ish real fleetd 1.19.0 and see if they work with the latests fleet server. This would check if there is any incompatibility (But not any mass/load issues)

TMWYT

sharon-fdm avatar Feb 26 '24 15:02 sharon-fdm

cc: @lucasmrod

sharon-fdm avatar Feb 26 '24 15:02 sharon-fdm

@sharon-fdm The main concern is whether v4.45.0 is backwards compatible with the very early versions of fleetd (still waiting for the exact version number, but somewhere in the 0.9 range)

ksatter avatar Feb 26 '24 15:02 ksatter

@ksatter Thanks for adding the context! I added a bulleted list of steps required for this issue under "More info" in the issue description. Does that look right?

@sharon-fdm It sounds like this is essentially a QA issue where we go through and check the last ~10 minor versions of Orbit/Desktop against Fleet v4.45.0 to make sure it as works as expected, and if not, document what doesn't work.

lukeheath avatar Feb 29 '24 14:02 lukeheath

@noahtalerman From a product perspective, what is the expected interoperability between old fleetd agents and the latest version of Fleet?

lukeheath avatar Feb 29 '24 14:02 lukeheath

Hi folks.

A while ago we agreed to follow this strategy: https://github.com/fleetdm/fleet/blob/main/docs/Contributing/fleetd-development-and-release-strategy.md

lucasmrod avatar Feb 29 '24 14:02 lucasmrod

@kswagler-rh Alternatively, can the customer stay on older versions of fleetd? I would expect there to be issues running versions that far apart. The most straight-forward approach would be for the customer to stay on v4.30.0 and the corresponding fleetd version (the version that was released at the same time). Then, when they're ready to upgrade the fleetd version, they can upgrade Fleet at the same time.

Are there blockers to that approach?

lukeheath avatar Feb 29 '24 14:02 lukeheath

A while ago we agreed to follow this strategy: https://github.com/fleetdm/fleet/blob/main/docs/Contributing/fleetd-development-and-release-strategy.md

Thanks, Lucas!

@kswagler-rh As defined in the document Lucas shared, our commitment is for new versions of fleetd to work with older versions of Fleet. As for old versions of fletd working with new versions of Fleet, there would be some issues because features that are available in Fleet would not yet be available on the hosts running old versions of fleetd if the feature required changes to fleetd.

lukeheath avatar Feb 29 '24 14:02 lukeheath

what is the expected interoperability between old fleetd agents and the latest version of Fleet?

The below is the expected approach. We don't do a good job documenting this in the user facing docs. Improving this is covered by this issue: #16349

The most straight-forward approach would be for the customer to stay on v4.30.0 and the corresponding fleetd version (the version that was released at the same time). Then, when they're ready to upgrade the fleetd version, they can upgrade Fleet at the same time.

On Monday (2024-02-09), @lucasmrod, @sharon-fdm, @xpkoala, @ksatter and I are meeting to align this approach and document it.

noahtalerman avatar Feb 29 '24 14:02 noahtalerman

@noahtalerman Apologies if I wasn't super clear, in this case, the customer would benefit greatly from getting Fleet and fleetd upgraded to a point where they can use the new host identifier functionality. Their concern is the "leagacy" installs of fleetd that they may not be able to get up-to-date.

ksatter avatar Feb 29 '24 15:02 ksatter

There are two issues in the discussion above:

  • Rules for compatibility. I am in favour of revisiting them next Monday (Mar 4th) with @noahtalerman
  • Specifically for customer_erda, I agree with @lukeheath we could run some QA tests with old agents (~0.9) per our QA priorities (Have a few things in the pipe)

sharon-fdm avatar Feb 29 '24 17:02 sharon-fdm

I am reclassifying this as a P2 per our new product group priority system. I am assigning @sharon-fdm and @noahtalerman to decide how to move forward on this.

lukeheath avatar Feb 29 '24 17:02 lukeheath

reclassifying this as a P2 per our new product group priority system.

Got it.

I'm proposing that we run a test to confirm compatibility between all released versions of fleetd and the current release of Fleet.

@ksatter I don't understand why we need to test all released versions of fleetd.

Let's plan to chat about this during our "Discuss supported fleetd version" call on Monday.

We can also get input from Sharon and Lucas on the level of effort for this.

Please let me know if this needs to happen sooner.

noahtalerman avatar Mar 01 '24 00:03 noahtalerman

@noahtalerman, I'm following up things. Would you like to set a discussion for this?

sharon-fdm avatar Mar 26 '24 15:03 sharon-fdm

Would you like to set a discussion for this?

Followed up in Slack here (internal).

noahtalerman avatar Mar 26 '24 22:03 noahtalerman

We met w/ the customer and decided to help them upgrade to a stable version (latest - 1) of the Fleet server.

Action items from the call:

  • [ ] Ask the customer what features/workflows they're using. This will help inform the test server Fleet uses
  • [x] Ask the customer for a list of all fleetd (agent) versions

After we have this info, the plan is to get the engineering team's help to run the tests.

Moving this to the customer-success board. @ksatter, we assigned this to you. Please let us know if you have any questions!

cc @Patagonia121 @lukeheath @pacamaster

noahtalerman avatar Apr 23 '24 17:04 noahtalerman

As much information we can get about their environment is helpful. It will help us better replicate their environment and make our migration tests more effective.

lukeheath avatar Apr 24 '24 22:04 lukeheath

@lukeheath @noahtalerman The customer replied with a list of all osquery agent versions in use across their Fleet:

Osquery 2.11.2 Osquery 3.3.2 Osquery 4.5.1 Osquery 4.6.0 Osquery 4.8.0 Osquery 5.2.2 Osquery 5.9.1 Osquery 5.11.0 Osquery 5.12.1

image (32)

Their comment: "Obviously supporting all of these with the new version would be unreasonable, but if you guys can provide us a cut-off as to where these would be likely be unsupported that would greatly help"

Patagonia121 avatar May 16 '24 19:05 Patagonia121

@Patagonia121 Thank you!

@noahtalerman I think the next step is to run load tests on each version to determine which are performant. That's going to take some resources, so we should bring this through estimation so we can plan it into sprint capacity accordingly.

lukeheath avatar May 16 '24 19:05 lukeheath

Sweet!

I added a user story to the issue description and started on the testing requirements.

@lukeheath I think this is less about performance and more about features the customers uses continuing to work.

@ksatter does that sound right?

In order to bring this one through estimation, I think we want a list of the features that the customer uses (ex. scheduled queries / live queries).

This way, we know what feature we want to test.

@ksatter can you please help us get this list of features? Once we have that list, please pass this issue to me (add :product) so we can get the testing rolling.

I moved the original issue description below for safekeeping.


Fleet version: Fleet v4.30.0

💥  Actual behavior

After upgrading fleetd to v1.19.0, the customer observed that the online/offline host count was fluctuating.

The issue resolved itself briefly after restarting the MySQL database and Fleet instances, but hosts began to show as offline again over time.

When reviewing customer logs, I was able to identify that the customer has duplicate hosts that are continuously re-enrolling in Fleet. It is believed that this, coupled with added functionality in fleetd, overloaded the database.

I have requested a full breakdown of the customer environment so that we can work on relieving the immediate pressure and confirm this root-cause analysis.

In order to ultimately resolve these duplicate hosts, we will also need to get the customer upgraded to a version of Fleet that supports setting the host identifier used by Fleetd and osquery, but the customer is concerned that they will run in to issues with some of the older versions of fleetd that are present in the customer's environment.

While getting all hosts up-to-date is the ultimate goal, there are some blockers to that process, and there are currently versions of Fleetd prior to 1.0 checking in to Fleet.

I've checked to see if there are any anticipated issues with communication between these hosts and the current release, but do not have a solid answer on that.

While we continue working to gather additional information and resolve the immediate performance issues, I'm proposing that we run a test to confirm compatibility between all released versions of fleetd and the current release of Fleet.

🧑‍💻  Steps to reproduce

🕯️ More info (optional)

  1. Run Fleet v4.45.0 locally.
  2. Enroll hosts using every fleetd minor release from 0.0.9 to the current version.
  3. Confirm the host responds as expected to scheduled queries, live queries, and policy checks.
  4. If there are any anomalies, document what version of fleetd and what happened.

I have an archive available for review.

noahtalerman avatar May 17 '24 15:05 noahtalerman

cc @Patagonia121 ^^

noahtalerman avatar May 17 '24 17:05 noahtalerman

@noahtalerman

In order to bring this one through estimation, I think we want a list of the features that the customer uses (ex. scheduled queries / live queries).

Agent communication, live queries and query packs would be the key pieces

ksatter avatar May 28 '24 18:05 ksatter

Thanks for adding the features to test @ksatter!

Heads up that I updated the plan to test w/ Fleet 4.49.4. I think this means we'll recommend that the customer upgrades to 4.49.4 after testing.

Note that the customer wanted to upgrade to latest - 1 (N - 1).

Let me know if you have any concerns with that.

cc @zayhanlon

noahtalerman avatar May 29 '24 14:05 noahtalerman

Hey @zayhanlon FYI the testing for customer-erda is scheduled to start next sprint (2024-06-03 kickoff).

@sharon-fdm heads up, I just moved this story into the specified column. Can you please work w/ the team to get it estimated today?

noahtalerman avatar May 29 '24 14:05 noahtalerman

@ksatter FYI on timelines (eta for 4.52.0 is 6/21)

zayhanlon avatar May 29 '24 14:05 zayhanlon

moved to Waiting pending clarity on testing requirements

jacobshandling avatar Jun 07 '24 20:06 jacobshandling

@ksatter @Patagonia121 when ready please approve the doc so we can close this.

sharon-fdm avatar Jun 17 '24 17:06 sharon-fdm

Sounds good @sharon-fdm, I'm going to sync with @ksatter today and we should be able to close this out after we discuss. Thanks!

Patagonia121 avatar Jun 17 '24 19:06 Patagonia121

@sharon-fdm this looks good, we can hand off this doc tomorrow to the customer during our regular call. Go ahead and close it out. Thanks!

Patagonia121 avatar Jun 17 '24 19:06 Patagonia121