fleet
fleet copied to clipboard
See individual profile status & resend profile to all failed hosts
Goal
| User story |
|---|
| As a Client Platform Engineer (CPE), |
| I want to see which hosts a specific configuration profile failed on and resend this profile to all failed hosts |
| so that I don't have to visit 1k hosts Host details pages to check and select Resend. |
Key result
Customer request
Original requests
- #25494
Changes
Product
- [ ] When a configuration profile is deleted, that profile is canceled for all pending hosts. ~~When a profile is edited, via GitOps, the old version of the profile is canceled. When a host is transferred to a new team, all pending configuration profiles are canceled.~~
- UPDATE: @noahtalerman: Later we can follow up when we get to editing profiles via UI/API: #23402
- [ ] UI changes: Figma here
- [ ] CLI (fleetctl) usage changes: No changes
- [ ] YAML changes: No changes
- [ ] REST API changes: PR here
- [ ] Fleet's agent (fleetd) changes: No changes
- [ ] GitOps mode changes: No changes
- [ ] Activity changes: PR here
- [ ] Permissions changes: Only maintainers and admins (global and team-level) can resend configuration profiles.
- [ ] Changes to paid features or tiers: Fleet Free and Fleet Premium
- [ ] Transparency changes: No changes
- [ ] First draft of test plan added
- [ ] Other reference documentation changes: No changes
- [ ] Once shipped, requester has been notified
- [ ] Once shipped, dogfooding issue has been filed
Engineering
- [x] Test plan is finalized
- [ ] Feature guide changes: https://github.com/fleetdm/fleet/issues/28764
- [ ] Database schema migrations: N/A
- [ ] Load testing: Will be done in QA
ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".
QA
Risk assessment
- Requires load testing: Yes, test the new "List hosts" filter with many hosts, and test resend to a big batch of hosts (should be similar in delay to adding a new custom profile to a team with many hosts)
- Risk level: High
- Risk description: This could accidentally add/remove profiles which could potentially place hosts in an undesirable state.
Test plan
Make sure to go through the list and consider all events that might be related to this story, so we catch edge cases earlier.
- [x] Ensure there's a checkmark icon on hover over configuration profile rows. Selecting that icon opens the status modal. ✔
- [x] The status modal shows "Resend" in the "Failed" row if at least one host failed. "Resend" is hidden if now hosts failed the profile. ✔
- [x] Select host counts in the status modal and verify that you're navigated to the Hosts page with a configuration profile status filter. ✔
- [x] Select the dropdown next to the configuration profile status filter. When you select each option (verified, verifying, pending, and failed), check to make sure that the list of hosts and host count is correct. ✔
- [x] In the status modal, select "Resend" and then select "Resend" in the resend modal. Verify that the "Pending" hosts count increases and "Failed" is set to 0 ("---"). Verify that the profile is actually re-sent to the hosts on the Host details page. Then, check the actual host (System Settings > Device Management) ✔
- [x] When resending a profile, verify that you get an error toast message if the resend request fails. ✔
- [x] After resending a profile, check the global activity feed. ✔
- [x] Add a configuration profile to Fleet with 2 hosts offline so that the profile goes to "Pending". Select the delete icon for that the profile. Verify that the copy in the delete is updated. ✔
- [x] Select "Delete" and verify that the rolled up "Pending" count decreases by 2. Check these hosts' Host details > OS settings modals. Verify that the profile doesn't show up in that modal. ✔
- [x] Open the status modal for a declaration (DDM) profile. Make a DDM profile fail on at least 1 host. Verify that the Resend button doesn't appear in the "Failed" row. ✔
- [x] Test the resend flow for both macOS and Windows configuration profiles. ✔
- [x] Hit the
POST /configuration_profiles/resend/batchas a user with the observer role and verify that you get a permissions error. ✔ - [x] Confirm you see an error when trying to resend a DDM profile (not supported) via API
- [ ] ~~Add a profile to Team A and then transfer a host from Team A to Team B. Verify that the profile that was "Pending" when the host was on Team A no longer appears on that host's Host details page. Also verify that the profile isn't delivered to the host. Check the actual host (System Settings > Device Management).~~ "Immediate" cancellation on team transfer is not implemented yet, profiles will still eventually be consistent after the reconcile profiles cron job runs.
- [ ] ~~Turn off MDM for one host. Ensure that "Pending" profiles are canceled. "Pending" counts decrease. Profiles don't appear on Host details page. Check the actual host (System Settings > Device Management).~~ Not sure what this test was for, but nothing has changed in this story regarding what happens when you turn off MDM, "Immediate" cancellation for that scenario is not implemented yet, profiles will still eventually be consistent after the reconcile profiles cron job runs.
- [ ] ~~Test adding and deleting labels that are used for targeting configuration profiles.~~ "Immediate" cancellation on editing profiles is not implemented yet, profiles will still eventually be consistent after the reconcile profiles cron job runs.
- [x] Test Disk Encryption profile behavior (disk encryption is not a custom profile, so its behavior is unchanged - cannot bulk-resend, cannot delete to cancel) ✔
- [x] When a configuration profile is deleted, the profile is canceled (InstallProfile MDM command is disabled at nano level) on all hosts where the profile is "pending" (we haven’t gotten an “ack”) ✔
- [x] When a host has a profile canceled, the profile doesn’t show up on the Host details > OS settings table, host vitals API, and the fleetctl get mdm-commands results. The “pending” status count on the Controls > OS settings page also decreases ✔
- [x] The profile is removed (send the “RemoveProfile” command) on all hosts where the profile is "verifying" or "verified." For these hosts, confirm that the profile has a "Removing enforcement (pending)" status. This update will happen immediately (before the profile reconciliation job runs). ✔ (it also changes to "Removing (pending)" status if the install was failed- since we cannot guarantee it wasn't installed previously - or if the status was Pending-Not-Null - something we can see only in the DB - as for that case it's unknown if the install command was already sent to the host or not)
- [ ] Edge case: InstallProfile command has been received but we didn’t get an “ack”. What happens?
- [ ] Fleet still cancels the profile (disables the InstallProfile command) ✔
- [ ] Later, nano records the “ack” and Fleet sends a RemoveProfile command. The host’s Host details page shows “Removing enforcement (pending)” ✔ (Fleet immediately sends the "RemoveProfile" on profile deletion if there's a possibility that the "Install" was sent, but the "remove failure" - if any - will be ignored, so that it doesn't stay in the list of host profiles with a "remove failed" error)
- [x] When managing configuration profiles via GitOps, if a profile is deleted (aka removed), the profile is canceled. Follow the same steps as above. ✔
Testing notes
Confirmation
- [ ] Engineer: Added comment to user story confirming successful completion of test plan.
- [ ] QA: Added comment to user story confirming successful completion of test plan.
For QA (@PezHub ): I did run most of the test plan with the branch of my last PR, I have yet to QA the hosts filter (GabeH is working on it and will merge his PR shortly, I'll finish this up tomorrow probably).
Just a few heads-up: I recorded a bunch of (confusing) videos in https://drive.google.com/drive/folders/1GiEkuGUKSXi1cAVqKN3hmA0fg6BZgUue (look for the "brp-" prefix).
- Using
osquery-perfto generate more hosts than the real physical one(s) can cause some issues due to its sometimes unexpected behaviour (e.g. for Windows its MDM enrollment is done later, after the first ping to Fleet, so they all need to be manually refetched so that MDM information is up-to-date, which fixes the counts in the profiles' status window). - I had manually changed the reconcile profiles cron jobs to run every 24h, and instead used
fleetctl trigger, but it seemed to not always run the reconciliation, which caused yet other issues that looked like bugs (e.g. the status counts of profiles after afleetctl gitopsthat added new profiles). With the standard 30s interval for the cron, it updated the counts as expected after the cron job ran.
If you see things that seem confusing (I certainly did...), don't hesitate to ping me. It may very well be bugs, but it may also be working as expected - the confusing thing is that we added "immediate status update" when deleting a profile, but any other change (e.g. adding a profile) requires the reconciliation to happen for the counts/statuses to reflect reality.
With those warnings out of the way, it did seem to behave correctly in my tests, although some things were surprising and took me a while to grasp/recognize as correct, even knowing the internal logic.
@PezHub One more thing to keep in mind is that fleetctl gitops does not immediately set the hosts to pending on new/updated profiles, it only updates statuses after the cron profile reconciliation job runs (it's unrelated to the BRP changes, but a bug that we've had for 2 years, for context: https://fleetdm.slack.com/archives/C03C41L5YEL/p1747836411007729)
@mna verified the two test plan items for host filtering. moving to QA now
QA Update:
Test plan complete but still need to load test
Load Test QA results - Round 1
Setup -
- Tested with 5k hosts on a team with ~20 config profiles for windows and macOS.
- Simulated failed profile deployments in order to test the batch resend.
Results for Windows -
Resending a profile to ~300 hosts:
- The modal updated as expected with the failed count moving to pending.
- The profiles were redeployed and out of pending rather quickly in
under 20secs
Metrics in AWS @ 17:30 did not show any significant spikes in CPU or Memory usage for the fleet server or the host containers
(Note: the spike prior to 17:30 is expected and due to the hourly cron jobs running)
Fleet Server
Hosts Containers
Results for macOS -
Resending a profile to ~1000 hosts:
- The modal updated as expected with the failed count moving to pending.
- The profiles were redeployed and out of pending in
~1min
Metrics in AWS @ 17:50 did not show any significant spikes in CPU or Memory usage for the fleet server or the host containers
(Note: the spike prior to 17:30 is expected and due to the hourly cron jobs running)
Fleet Server
Hosts Containers
*I also spot checked the mysql reader/writer metrics for both platform tests and all looked good.
Load Test QA results - Round 2 (twice as many hosts and failed profiles)
Setup -
- Tested with
10khosts on a team with ~20 config profiles for windows and macOS. - Simulated failed profile deployments in order to test the batch resend.
Results for Windows -
Resending a profile to ~500 hosts:
- The modal updated as expected with the failed count moving to pending.
- The profiles were redeployed and out of pending in under
1min - Metrics in AWS @ 01:45 did not show any significant spikes in CPU or Memory usage for the fleet server or the host containers
Fleet Server
Host Containers
Results for macOS -
Resending a profile to ~2000 hosts:
- The modal updated as expected with the failed count moving to pending.
- The profiles were redeployed and out of pending in
~1min,15sec - Metrics in AWS @ 01:55-59 did not show any significant spikes in CPU or Memory usage for the fleet server or the host containers
Fleet Server
Host Containers
Summary of load tests:
5k hosts - Windows = 300 profiles took about 20 secs to resend macOS = 1000 profiles took about 1 min to resend
10k hosts - Windows = 500 profiles took about 1 min to resend macOS = 2000 profiles took about 1:15 to resend
Additional Metrics for Round 2 of load testing
Timespan to watch is between 01:40 - 2:00
Container Insights
mySQL Reader
mySQL Writer
Resend profiles swift, No need to check each host, Efficiency, a gift.