fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Upcoming activities stuck in queue

Open rachaelshaw opened this issue 8 months ago • 4 comments

Fleet version: 4.65.0


💥  Actual behavior

I tried to install self-service software about a month ago, and it never installed. After that, any software installs or script runs added to the queue haven't moved.

Image

🧑‍💻  Steps to reproduce

Not entirely sure how to reproduce this, but here's a host it's happening on: https://dogfood.fleetdm.com/hosts/519

🕯️ More info (optional)

N/A

🛠️ To fix

Product designer: @marko-lisica Understand why are activities stuck and resolve so it doesn't happen anymore.

rachaelshaw avatar Mar 26 '25 22:03 rachaelshaw

Hey team! Please add your planning poker estimate with Zenhub @getvictor @ghernandez345 @gillespi314 @mna

georgekarrv avatar Apr 02 '25 16:04 georgekarrv

I'm escalating to a P1 because from the user's perspective there are items stuck in pending and the workflow is broken.

@marko-lisica @georgekarrv @mna Would y'all please work together to see if we can get a fix in for this next week and included in 4.67.0? Thanks!

lukeheath avatar Apr 04 '25 18:04 lukeheath

I'll see what we can bump out for 3sp but I think we can. Also Cancel is being developed atm so this should at least be less impactful w/ cancel released.

georgekarrv avatar Apr 04 '25 18:04 georgekarrv

Starting investigation on this, some notes:

  • The stuck activity is a VPP app install
  • It was enqueued quite some time before the upcoming activities queue was implemented, at 2024-11-12 22:35:08
  • The reason why it shows up as "a month ago" is because that's when it got migrated to the new unified queue (on 2025-02-26 21:20:35.545269), but it was stuck since november
  • Looking at nano_cert_auth_associations, the timestamp when the VPP app was enqueued has a cert renewal shortly after that date, created on 2024-11-19 22:36:29, and the previous cert entry was expired before that (cert not valid after): 2024-04-02
  • The MDM command is now inactive, and the timestamp of its last update matches exactly the timestamp the cert was renewed:
mysql> select * from nano_enrollment_queue where command_uuid = '84e25621-faa6-4af4-9a2a-67455dcbf448';
+--------------------------------------+--------------------------------------+--------+----------+----------------------------+----------------------------+
| id                                   | command_uuid                         | active | priority | created_at                 | updated_at                 |
+--------------------------------------+--------------------------------------+--------+----------+----------------------------+----------------------------+
| 9BDD6D41-07FA-5E69-823F-1ABA5BFC5174 | 84e25621-faa6-4af4-9a2a-67455dcbf448 |      0 |        0 | 2024-11-12 22:35:08.263017 | 2024-11-19 22:36:29.110874 |
+--------------------------------------+--------------------------------------+--------+----------+----------------------------+----------------------------+

All this to say, it looks very much like this VPP app command (MDM) is stuck due to having been created when the cert was expired, and on cert renewal those old commands got deactivated.

cc @georgekarrv

mna avatar Apr 09 '25 20:04 mna

There was a bunch of mac hosts that got unenrolled from MDM in November 2024 (probably due to the SCEP cert renewal past its expiration):

mysql> select created_at, activity_type, details->'$.host_display_name' from activities where activity_type = 'mdm_unenrolled' and created_at between '2024-11-01' and '2024-12-01' order by created_at desc limit 5;
+----------------------------+----------------+---------------------------------+
| created_at                 | activity_type  | details->'$.host_display_name'  |
+----------------------------+----------------+---------------------------------+
| 2024-11-29 00:25:20.923310 | mdm_unenrolled | "Lucas’s MacBook Pro"           |
| 2024-11-28 22:55:34.049316 | mdm_unenrolled | "MacBookPro16,2 (C02G90U2ML85)" |
| 2024-11-27 22:05:16.314309 | mdm_unenrolled | "Dale’s MacBook Pro"            |
| 2024-11-20 18:46:22.885064 | mdm_unenrolled | "Harrison’s iPhone"             |
| 2024-11-20 18:44:09.641478 | mdm_unenrolled | "Rachael’s MacBook Pro"         |
+----------------------------+----------------+---------------------------------+
5 rows in set (0.05 sec)

This was some months before the unified queue was implemented, but if that happened now, the VPP app install command would be automatically removed as part of unenrolling from MDM: https://github.com/fleetdm/fleet/pull/26816

And of course, the cert renewal should normally happen before expiration.

So there is nothing to do to fix this issue - this is something that should not happen anymore. And to unblock this specific case, the Cancel Upcoming Activity story https://github.com/fleetdm/fleet/issues/25540 will allow cancelling this one and unblock the rest of the queue (once it gets deployed to dogfood). /cc @rachaelshaw

I will close this issue.

mna avatar Apr 14 '25 14:04 mna

Queues unjam, tasks flow, In cloud city, systems grow, Fleet's progress in tow.

fleet-release avatar Apr 14 '25 14:04 fleet-release

(Setting milestone back to 4.67.0 as it was closed during this sprint)

mna avatar Apr 14 '25 16:04 mna

Now that we can cancel upcoming activities, I cancelled the install that was stuck, and the rest of the pending actions went through after that 👍

rachaelshaw avatar Apr 25 '25 22:04 rachaelshaw