fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Unable to reinstall Fleet agent on Windows after unenrolling and removing the agent via API

Open pintomi1989 opened this issue 1 year ago • 4 comments

Fleet version: v4.49.3

Operating system: Windows


💥  Actual behavior

Unable to reinstall the Fleet agent on Windows devices after unenrolling ing the device via Fleet API (using https://github.com/fleetdm/fleet/blob/main/scripts/mdm/windows/windows-unenroll-mdm.ps1), and then removing the Fleet agent via API (using https://github.com/fleetdm/fleet/blob/main/scripts/mdm/windows/windows-remove-fleetd.ps1)

🧑‍💻  Steps to reproduce

Install windows fleet agent from a .msi built with fleetctl fleetctl package --type="msi" --enable-scripts --fleet-desktop --fleet-url="xxx" --end-user-email="[email protected]" --enroll-secret="zzz"

Unenroll the device with fleet API by running the following script : https://github.com/fleetdm/fleet/blob/main/scripts/mdm/windows/windows-unenroll-mdm.ps1

Remove the fleet agent with fleet API by running the following script : https://github.com/fleetdm/fleet/blob/main/scripts/mdm/windows/windows-remove-fleetd.ps1

Try to re-install fleet agent with the .msi I have no error during the installation

C:\Program Files\Orbit or C:\Program Files\Fleet or C:\Program Files\FleetDM doesn't exist fleet agent doesn't appear in the touch bar (msi was build with fleet desktop option)

pintomi1989 avatar May 22 '24 11:05 pintomi1989

I can confirm the reproduction steps work for me. I followed them pretty much exactly, providing my own fleet url and enroll secret (using the user's email as in the repro steps). I used the UI to run the unenroll and remove scripts. Reinstalling the msi afterwards seems to succeed, but does not start orbit nor fleet desktop (verified via task manager). Orbit logs show something like this on reinstall attempts (re-running the msi):

2024-05-22T09:49:00-07:00 INF deleting enroll secret file: C:\Program Files\Orbit\secret.txt
2024-05-22T09:49:01-07:00 INF hash(orbit)=<hash>
2024-05-22T09:49:02-07:00 INF hash(osqueryd)=<hash>
2024-05-22T09:49:02-07:00 INF hash(desktop)=<hash>
2024-05-22T09:49:02-07:00 INF orbit version: 1.24.0
2024-05-22T09:49:03-07:00 INF Found osquery version: 5.12.1
2024-05-22T09:49:12-07:00 INF Service Stop Requested

Tested from Fleet v4.49.3.

Note that running the removal script keeps the script execution activity in the "upcoming" list even after it has run (might be unavoidable since the goal of the script is to remove the agent, so it cannot report its results).

mna avatar May 22 '24 14:05 mna

@mna The customer also reported this behavior without running the removal script through Fleet. I'm looking for some more information about that scenario now.

ksatter avatar May 22 '24 15:05 ksatter

@mna Can you take a look at the logs here? It looks to me like on this host, prior to uninstalling, there may have been multiple instances of fleetd running, one auto-deployed from MDM and one locally installed.

2024-05-13T15:40:20+02:00 WRN failed to get C:\Program Files\Orbit\bin\osqueryd\windows\stable\osqueryd.exe version: : exec: "C:\\Program Files\\Orbit\\bin\\osqueryd\\windows\\stable\\osqueryd.exe": file does not exist
2024-05-13T15:40:20+02:00 INF update detected target=osqueryd
2024-05-13T15:40:23+02:00 INF early update check failed error="update osqueryd: get binary: getting target: download \"osqueryd/windows/stable/osqueryd.exe\": move download: rename C:\\Program Files\\Orbit\\staging\\osqueryd.exe C:\\Program Files\\Orbit\\bin\\osqueryd\\windows\\stable\\osqueryd.exe: The process cannot access the file because it is being used by another process."
2024-05-13T15:40:27+02:00 INF get osqueryd target failed error="getting target: download \"osqueryd/windows/stable/osqueryd.exe\": move download: rename C:\\Program Files\\Orbit\\staging\\osqueryd.exe C:\\Program Files\\Orbit\\bin\\osqueryd\\windows\\stable\\osqueryd.exe: The process cannot access the file because it is being used by another process."
2024-05-13T15:41:01+02:00 INF Service Interrogate Requested

ksatter avatar May 22 '24 15:05 ksatter

@ksatter thanks! Heads-up though that I just validated the reproduction steps but am not currently working on this, I know that @jahzielv also managed to reproduce and can possibly investigate further if needed (though I believe he is also busy with a P2 priority at the moment).

mna avatar May 22 '24 18:05 mna

I was able to reproduce with just the windows-remove-fleetd.ps1 script (i.e. without enrolling the host in MDM and running the unenroll MDM script). Even after a reboot of the Windows machine, installing Fleetd briefly starts the orbit process (and also briefly shows Fleet Osquery in the installed Apps) before removing it and stopping the process.

The logs always look something like this:

2024-06-05T09:16:40-07:00 INF killing any pre-existing fleet-desktop instances
2024-06-05T09:16:40-07:00 INF opening path="C:\\Program Files\\Orbit\\bin\\desktop\\windows\\stable\\fleet-desktop.exe"
2024-06-05T09:16:42-07:00 INF Service Stop Requested
2024-06-05T09:16:42-07:00 ERR interrupt serviceChecker error="os service stop request"
2024-06-05T09:16:42-07:00 ERR interrupt updater error="os service stop request"
2024-06-05T09:16:42-07:00 ERR interrupt config receivers error="os service stop request"
2024-06-05T09:16:42-07:00 ERR interrupt osquery error="os service stop request"
2024-06-05T09:16:42-07:00 ERR interrupt osquery extension error="os service stop request"
2024-06-05T09:16:42-07:00 ERR SignalProcessBeforeTerminate error="get process: process not found"

mna avatar Jun 05 '24 13:06 mna

Ok I think I understand what's happening - it's due to how Fleet never gets the result of the windows-remove-fleetd.ps1 script (since the script removes fleetd from the host, so it cannot notify Fleet of the result of its execution). This causes the script execution request to stay in "pending/upcoming" state and when fleetd is reinstalled, it receives the pending script to execute and executes it which... removes fleetd again.

We need a way to mark that script as executed even though fleetd gets killed.

mna avatar Jun 05 '24 13:06 mna

Ok I think I understand what's happening - it's due to how Fleet never gets the result of the windows-remove-fleetd.ps1 script (since the script removes fleetd from the host, so it cannot notify Fleet of the result of its execution). This causes the script execution request to stay in "pending/upcoming" state and when fleetd is reinstalled, it receives the pending script to execute and executes it which... removes fleetd again.

We need a way to mark that script as executed even though fleetd gets killed.

hI @mna !

Thanks for the investigation! Have you tried to remove the device from fleet before re-enrolling it ? If the device re-enroll properly it would make your idea even more valid, if not, it might be another issue 🤔

valentinpezon-primo avatar Jun 05 '24 13:06 valentinpezon-primo

Have you tried to remove the device from fleet before re-enrolling it ? If the device re-enroll properly it would make your idea even more valid, if not, it might be another issue 🤔

I've validated it in a different way, I forced the script execution request to "done" in the database and reinstalled, and it reinstalled properly. I'll do it the way you mention too, just to double-check, but I'm pretty sure that's the issue.

mna avatar Jun 05 '24 13:06 mna

@valentinpezon-primo yeah I confirmed that by removing the device from fleet before re-installing also works.

mna avatar Jun 05 '24 14:06 mna

I took a look at the shutdown process of fleetd, but there's no way that I can think of that would guarantee that a "service stop" was caused by a running script, so I don't think that hooking into the shutdown to send a script result to Fleet is a viable approach (it could just as well mark an unrelated scripts as "done" when it was shutdown for another reason).

I think it will have to be something like @dantecatalfamo implemented in the Linux Wipe script where the script itself launches a (completely detached) sub-process that executes the actual job, while returning control to the fleetd agent so that it can send the result to Fleet.

mna avatar Jun 05 '24 18:06 mna

@valentinpezon-primo Heads-up that there's a PR with a fix to the script, but if you want to give it a try ahead of the merge or release, since it's just a script file and not part of the build, you can get it directly from the PR (you can see the diff here: https://github.com/fleetdm/fleet/pull/19643/files#diff-45b7a70ea669452e3b6727ea5c6b5bcc5a689f3222634b1cc2920fd94384f37d).

As mentioned in the previous comment, the basic idea is that the script starts a sub-process that will do the actual removal and returns without waiting for it to finish, so that fleetd can send the script execution results to Fleet and mark the script request as "done" (so that it doesn't try to run again if the agent is reinstalled). Let me know if you run into any issues!

mna avatar Jun 11 '24 14:06 mna

Ran through a couple scenarios and can confirm the new script is working as expected. Re-enrolling my windows host succeeds after successful runs. QA Approved!

Scenario A: Uninstall from CLI

Steps:

  1. Run removal script
fleetctl run-script --script-path=scripts/mdm/windows/windows-remove-fleetd.ps1 --host=B953CB00-EBD2-11EE-A95C-F83E9F475400

Script is running. Please wait for it to finish...

Exit code: 0 (Script ran successfully.)

Output:

-------------------------------------------------------------------------------------

About to uninstall fleetd...
Removal process started: 9140.
  1. Successfully re-enrolled with the following .msi installer

fleetctl package --type=msi --enable-scripts --fleet-desktop --fleet-url=https://pezhub.ngrok.app [email protected] --enroll-secret=xyz

Scenario B: Uninstall from UI

Steps:

  1. Ran the unenroll script
Exit code: 0 (Script ran successfully.)
The output recorded when WIN11-PC ran the script above:
Device unregistration called successfully.
  1. Ran the remove fleetd script
Exit code: 0 (Script ran successfully.)
The output recorded when WIN11-PC ran the script above:
About to uninstall fleetd...
Removal process started: 4132.

Both scripts show as completed and don’t remain in a pending state creating the undesired removal loop.

Note: I made sure to test this with and without deleting the device from fleet after the scripts ran, before re-enrolling.

I also ran thru the same scenario in reverse i.e. remove fleetd first, then unenroll. This left the unenroll script in a pending state. Re-enrolling (and leaving the device in Fleet) with the .msi succeeded and the original unenroll script changed from pending to success without trying to unenroll the device.

Screenshot 2024-06-13 at 5 41 21 PM

Scenario C: I ran thru the same steps with Postman API successfully

@valentinpezon-primo Hopefully that covers most scenarios but curious to hear about your results!

PezHub avatar Jun 14 '24 01:06 PezHub

Thanks for the detailed answer @PezHub !

It should cover our usecases yeah 👌

valentinpezon-primo avatar Jun 14 '24 07:06 valentinpezon-primo

Reinstall, like rain, Windows agent blooms again, Fleet's cycle, unchained.

fleet-release avatar Jun 21 '24 00:06 fleet-release