fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Adding/removing Apple (macOS,iOS,iPadOS) profiles in the UI takes 15+ seconds

Open roperzh opened this issue 1 year ago • 17 comments

Fleet version: 4.55.0

Web browser and operating system:


💥  Actual behavior

@PezHub: I was able to test this in my current load test env built off 4.64 with 20K hosts:

  • adding profiles took 15-25secs (added 15 total)
  • deleting a profile took 15-35sec
  • I never saw a timeout.
  • CPU and memory utilization did spike during the tests but eventually waned

🧑‍💻  Steps to reproduce

  1. Add 20k hosts in a load test environment hosts with Apple MDM turned on
  2. Add at least 10 profiles
  3. Try to add/remove a profile, observe how the request takes longer than expected or times out

🕯️ More info

From 2024-08-14: principal culprit seems to be:

https://github.com/fleetdm/fleet/blob/16d6757681a1e41f228eec798c4c0f0293b7cf0c/server/datastore/mysql/apple_mdm.go#L1777-L1782

🛠️ To fix

@marko-lisica: Should be tested against 30k hosts (our largest deployment). Test by uploading 15 profiles.

roperzh avatar Aug 15 '24 15:08 roperzh

Heads up, this will likely be bigger than 2

gillespi314 avatar Aug 29 '24 19:08 gillespi314

Related: https://github.com/fleetdm/fleet/issues/23816

nonpunctual avatar Nov 14 '24 20:11 nonpunctual

QA Notes:

Completed load testing with slightly improved results but will revisit after the holidays when additional engineers are available to look.

Slack convo here

PezHub avatar Nov 23 '24 01:11 PezHub

@dantecatalfamo We may be stepping on each other since I'm trying to fix some issues with my current load test.

I'll let you fix the issue with batch delete. Each batch should run in its own transaction -- running 60k hosts in 1 transaction is a no-go. image

My loadtest branch: https://github.com/fleetdm/fleet/pull/24338

getvictor avatar Dec 04 '24 00:12 getvictor

As discussed in standup, we're going to hold off on changes related to Victor's comment for the time being and will address it in an upcoming sprint, possibly in conjunction with the unified queue work. See related comment.

Note that Dante's PR, which improves performance adding removing profiles in the UI should be included 4.61.0.

gillespi314 avatar Dec 10 '24 18:12 gillespi314

Waiting for new activity queue to be in before revisiting this issue.

getvictor avatar Jan 14 '25 15:01 getvictor

  1. Add a significant number of hosts with MDM turned on (doesn't need to be 30k to see the impact)
  2. Add at least 10 profiles
  3. Try to add/remove a profile, observe how the request takes longer than expected or times out

Note that Dante's https://github.com/fleetdm/fleet/pull/23772, which improves performance adding removing profiles in the UI should be included 4.61.0.

@georgekarrv did we still see long load times or timeouts after @dantecatalfamo's improvement? If no, I think we can close this bug.

If we did, please let us know how long the load times were.

Thanks!

noahtalerman avatar Feb 12 '25 15:02 noahtalerman

Hi @noahtalerman, I was able to test this in my current load test env built off 4.64 with 20K hosts and I can confirm that I'm seeing significant improvements:

  • adding profiles took 15-25secs (added 15 total)
  • deleting a profile took 15-35sec
  • I never saw a timeout.
  • CPU and memory utilization did spike during the tests but eventually waned

I think it's ok to close this ticket.

Update: I'm going to create a new ticket to continue making improvements in this area when additional hosts (more than 20K) and profiles (more than 10) are added and also when moving large amounts of hosts from one team to another with each team having at least 10 unique profiles

Update 2: will actually just keep this ticket open and include additional metrics in the comments below

PezHub avatar Feb 12 '25 18:02 PezHub

@PezHub thanks! 15-25 seconds is too long. We want all actions in Fleet to take less than 5 seconds.

I updated the issue description with your findings and moved this one to "Ready to estimate"

cc @marko-lisica

noahtalerman avatar Feb 12 '25 20:02 noahtalerman

understood and good to know that ~5secs is our target! I'll keep that in mind for all future load tests. thanks!

PezHub avatar Feb 12 '25 21:02 PezHub

Additional Metrics when Profiles are added/deleted on a team with 20K hosts

Fleet Service Image

DB Writer Image

DB Reader Image

PezHub avatar Feb 13 '25 18:02 PezHub

Hey team! Please add your planning poker estimate with Zenhub @getvictor @ghernandez345 @mna

georgekarrv avatar Mar 12 '25 16:03 georgekarrv

This should sit behind adding mdm commands to the unified queue so we don't duplicate work

georgekarrv avatar Mar 24 '25 17:03 georgekarrv

  • adding profiles took 15-25secs (added 15 total)
  • deleting a profile took 15-35sec
  • I never saw a timeout.
  • CPU and memory utilization did spike during the tests but eventually waned

Hey @PezHub when you get the chance, can you please run the same tests with 6.1k hosts? I assigned the bug to you.

Is it over 5 seconds?

noahtalerman avatar Mar 25 '25 19:03 noahtalerman

Hey @noahtalerman , I was able to rerun the tests in the following env: Fleet v4.66.0 Host count = 6.5K total (5,201 mac, 650 Win, 650 Linux) Profile count = MDM 15, DDM 3, Windows 3 *note - ddm and windows profiles have always uploaded/deleted quickly, the main objective was to test mdm config profiles

Results are def better than with 20K hosts. The average time to upload a .mobileconfig file is ~5sec and a little bit less to delete them.

Here's a short video showing the UI workflow

PezHub avatar Apr 02 '25 00:04 PezHub

I updated the count to 6K macOS hosts just to make sure it matched your original request and I'm still seeing the same results @ ~5-6secs

Image

I kept increasing the host count by 1500K and it appears that every time we jump by ~1K hosts it adds a second to the load time:

7-8K hosts = ~7-8sec 9-10K hosts = ~10sec

PezHub avatar Apr 02 '25 00:04 PezHub

I updated the count to 6K macOS hosts just to make sure it matched your original request and I'm still seeing the same results @ ~5-6secs

Image

I kept increasing the host count by 1500K and it appears that every time we jump by ~1K hosts it adds a second to the load time:

7-8K hosts = ~7-8sec 9-10K hosts = ~10sec

Thanks @PezHub! FYI @pintomi1989 bringing this to product office hours to discuss.

noahtalerman avatar Apr 02 '25 19:04 noahtalerman

@noahtalerman approved by customer-preston to close this out

zayhanlon avatar Apr 17 '25 14:04 zayhanlon

Apple profiles load slow, Fleet finds path through data flow, Swift as river's current go.

fleet-release avatar Apr 17 '25 14:04 fleet-release