fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Wiping Windows hosts puts some laptops in a non-bootable state

Open ddribeiro opened this issue 1 year ago • 17 comments

Fleet version: Observed in customer-preston's environment

Web browser and operating system: N/A


💥  Actual behavior

customer-preston is reporting that after issuing a wipe command through Fleet, some of their Windows hosts end up in a non-bootable state and Windows needs to be re-installed.

@allenhouchins and @nonpunctual: we've seen this ourselves when wiping Windows workstations. Success rate is 50%.

🧑‍💻  Steps to reproduce

  1. Enroll a Windows host to Fleet
  2. Wipe a Windows host

🕯️ More info (optional)

See comment section for more info.

🛠️ To fix

  1. Investigate: if Fleet were to use a different MDM command (doWipe or doWipeProtected) would that resolve the issue while maintaining the expected behavior?
  2. If yes, make the change.
  3. If there's a change in the behavior based on using a different MDM command let Product Design know.

ddribeiro avatar Sep 11 '24 15:09 ddribeiro

Additional notes: Microsoft's RemoteWipe CSP Documentation

It is unclear from Microsoft's documentation whether a regular doWipe command will actually prevent the device from being non-bootable. From the doWipe CSP description:

If a doWipe reset is started and then interrupted, the PC will attempt to roll-back to the pre-reset state. If the PC can't be rolled-back, the recovery environment will take no additional actions and the PC could be in an unusable state and Windows will have to be reinstalled.

From the doWipeProtected description:

The doWipeProtected is functionally similar to doWipe. But unlike doWipe, which can be easily circumvented by simply power cycling the device, doWipeProtected will keep trying to reset the device until it's done.

I think "functionally similar" is an interesting choice of words. My interpretation is that both commands perform the same action under the hood, but doWipeProtected will continuously attempt to reset the device if it is interrupted. This is probably something we need to dive into a little deeper before we decide to pick this up.

cc: @nonpunctual

ddribeiro avatar Sep 11 '24 15:09 ddribeiro

My interpretation of the doWipeProtected command in the Microsoft docs is that it is intended to be used only in the case that a device is irretrievably lost (e.g., in a river, smashed in traffic, stolen).

Fleet should be using doWipeProtected for this. The intention of the Fleet wipe feature is: admins to have a way to protect an asset belonging to an organization that is irretrievably lost.

customer-preston is not using the feature this way. They are using the Fleet wipe feature to repurpose devices for MSP customers. This is a valid use case, but, not aligned with the feature as is.

My opinion here is that if we do anything, we should add a feature for device repurposing, or, instruct the customer that the feature we've deployed is not intended for device repurposing & they could create their own solution for this.

nonpunctual avatar Sep 11 '24 17:09 nonpunctual

Before we dedicate any design/eng resources let's understand if the DoWipe CSP will solve the customer's need: reset the device w/o having to re-install Windows.

How?

  • Send DoWipe 10 times. How many times was the device in an non-bootable state?
    • What the end end user experience? Can the end user cancel it? Can they cancel it w/ Reboot?
  • Send DoWipeProtected 10 times. How many times was the device in an non-bootable state?

If DoWipe performs better then consider building a "Reset" option.

noahtalerman avatar Sep 12 '24 19:09 noahtalerman

Dave: https://learn.microsoft.com/en-us/answers/questions/247954/wipe-action-resulting-in-recovery-failure-on-windo

noahtalerman avatar Sep 12 '24 20:09 noahtalerman

@noahtalerman @ddribeiro @dherder What I actually think is more troubling is that doWipeProtected seems like it should completely erase the computer every time, no matter what. The fact that it's being reported that it only deletes the computer sometimes (what's actually being reported is that sometimes it DOES NOT wipe the computer, which is actually what I think the customer wants ie they do not want to have to reinstall Windows...) is an issue for Miscorsoft imo.

nonpunctual avatar Sep 12 '24 21:09 nonpunctual

@nonpunctual We should clarify, when we say "fail" (in Noah's comment) we mean "the device wipes but results in an non-bootable state." From what I understand the device always wipes successfully when a wipe command is issued from Fleet.

ddribeiro avatar Sep 12 '24 21:09 ddribeiro

@ddribeiro and @nonpunctual thank you both!

I updated my comment here to clarify "failed":

How many times was the device in an non-bootable state?

noahtalerman avatar Sep 12 '24 22:09 noahtalerman

Hey @georgekarrv and @lukeheath do you think we could get some QA/engineering help to do the testing outlined in this comment?

It will help us understand the problem so we can come up with the best solution.

I think @ddribeiro can help guide who ever ends up doing the testing.

noahtalerman avatar Sep 13 '24 18:09 noahtalerman

@noahtalerman I'm not sure if there is any immediate capacity, but this could be estimated and brought into the sprint as a timebox item.

lukeheath avatar Sep 13 '24 18:09 lukeheath

@zayhanlon and I decided to pull this one out of the current design sprint and prioritize the following request instead:

  • #22028

noahtalerman avatar Sep 16 '24 20:09 noahtalerman

Found this here : image

It seems that using the doWipeProtected on encrypted device make it so the device is unbootable again (which is the exact problem we have) it's easy to test if you have some windows testing material

Here is the different test scenarios I see :

Use doWipeProtected on encryped device:

  • encrypt device
  • use doWipeProtected
  • confirm the behavior

Use doWipe on encryped device:

  • encrypt device
  • use doWipe
  • confirm the behavior

Use doWipeProtected on un-encrypted device:

  • remove device encryption
  • use doWipeProtected
  • confirm the behavior

Use doWipe on un-encrypted device:

  • remove device encryption
  • use doWipe
  • confirm the behavior

valentinpezon-primo avatar Sep 19 '24 11:09 valentinpezon-primo

Linked to Unthread ticket:

Feature Requests and Issues Recap #2838)

JoStableford avatar Sep 19 '24 20:09 JoStableford

Problem

customer-preston is reporting that after issuing a wipe command through Fleet, some of their Windows hosts end up in a non-bootable state and Windows needs to be re-installed.

They believe this is because Fleet is sending the doWipeProtected Windows CSP and that sending using doWipe instead would prevent this behavior.

What have you tried?

The customer would be able to build their own Windows CSP that uses doWipe instead of doWipeProtected send it to the Fleet API.

However, there are benefits they get by using the native Fleet behavior, including:

  • Calling a single API endpoint that works for all platforms.
  • Device lock state reporting in Fleet

Potential solutions

When issuing a wipe command using Fleet's native functionality, Fleet could support the ability for an admin to specify what kind of wipe command is issue. To solve this problem, Fleet could offer the option to send a doWipe or doWipeProtected command.

What is the expected workflow as a result of your proposal?

When customer-preston refreshes a Windows device to issue to a new user, they use Fleet's wipe command to erase the pervious user's data. In this workflow, they would:

  • Call the Fleet API to wipe the device
  • In the API call, they would specify they want to use a doWipe instead of the default doWipeProtected behavior.
  • The device would receive the command and wipe successfully. If the process is interrupted, the computer would not result in an non-bootable state like what appears to happen with a doWipeProtected command.

Because this workflow is not a stolen/lost device situation, doWipeProtected is not required.

noahtalerman avatar May 15 '25 18:05 noahtalerman

Brock: Response to CrowdStrike outage: https://patchmypc.com/quick-machine-recovery-cloud-based-remediation

noahtalerman avatar May 15 '25 18:05 noahtalerman

I was able to replicate this in dogfood: https://dogfood.fleetdm.com/hosts/1230

allenhouchins avatar May 15 '25 18:05 allenhouchins

@juan-fdz-hawa does this seem like an appropriate test plan? Are there any UI changes? (I didn't see any) Is there anything else that would be good to take a look at?

doWipeProtected

  1. Enroll a Windows host with MDM enabled
  2. Using the /api/v1/fleet/hosts/:id/wipe endpoint send a wipe command with the body:
{
    "wipe_type": "doWipeProtected"
}
  • [ ] Ensure Windows host is wiped but not unbootable

doWipe

  1. Enroll a Windows host with MDM enabled
  2. Using the /api/v1/fleet/hosts/:id/wipe endpoint send a wipe command with the body:
{
    "wipe_type": "doWipe"
}
  • [ ] Ensure Windows host is wiped but not unbootable

  • [ ] Verify API docs are updated and make sense as written

jmwatts avatar Jun 20 '25 21:06 jmwatts

I tried:

doWipe

Enroll a Windows host with MDM enabled. Encrypt the host. Using the /api/v1/fleet/hosts/:id/wipe endpoint send a wipe command with the body: { "wipe_type": "doWipe" }

  • [🔴] Ensure Windows host is wiped but not unbootable - unfortunately it seems to have left me in a non-operable state

level=debug ts=2025-06-20T21:32:16.773175Z component=http [email protected] method=POST uri=/api/v1/fleet/hosts/45/wipe took=26.289ms

Image

I've tried all of the options and I'm not able to get a functioning install of windows going.

NOTE: There is no way to tell in the UI nor the fleet log which command was sent. There is no way to differentiate which command is sent via the UI (guessing it will continue to be the doWipeUnprotected one? Maybe we should make it clear in the API docs which one is sent if neither is specified?

I'll have to get my Windows box functioning before I can continue testing.

jmwatts avatar Jun 20 '25 21:06 jmwatts

@juan-fdz-hawa does this seem like an appropriate test plan? Are there any UI changes? (I didn't see any) Is there anything else that would be good to take a look at?

doWipeProtected

  1. Enroll a Windows host with MDM enabled
  2. Using the /api/v1/fleet/hosts/:id/wipe endpoint send a wipe command with the body:
{
    "wipe_type": "doWipeProtected"
}
  • [ ] Ensure Windows host is wiped but not unbootable

doWipe

  1. Enroll a Windows host with MDM enabled
  2. Using the /api/v1/fleet/hosts/:id/wipe endpoint send a wipe command with the body:
{
    "wipe_type": "doWipe"
}
  • [ ] [ ] Ensure Windows host is wiped but not unbootable [ ] [ ] Verify API docs are updated and make sense as written

There are no UI changes, when issuing a wipe command from the UI it should work as before (and do a doWipeProtected), the only way to run the new wipe command is via the API endpoint, that said, the posted payload is wrong (see here), it should be something like:

{   "windows": { "wipe_type": "doWipe" | "doWipeProtected" } }

The payload is also optional, so if none provided the wipe operation should default to doWipeProtected

juan-fdz-hawa avatar Jun 23 '25 17:06 juan-fdz-hawa

Thanks @juan-fdz-hawa This makes sense, I'm guessing that since I had the payload wrong, it defaulted to doWipeProtected which bricked my test machine.

I'd like to suggest we provide a very explicit example of the correct payload otherwise they may end up in the same unbootable state.

If I can figure out how to get my test machine bootable again, I can try with the correct payload.

jmwatts avatar Jun 23 '25 17:06 jmwatts

Thanks @juan-fdz-hawa This makes sense, I'm guessing that since I had the payload wrong, it defaulted to doWipeProtected which bricked my test machine.

I'd like to suggest we provide a very explicit example of the correct payload otherwise they may end up in the same unbootable state.

If I can figure out how to get my test machine bootable again, I can try with the correct payload.

Yup, and yup.

I'll update the docs and add some debug log lines to make testing/debugging easier.

juan-fdz-hawa avatar Jun 23 '25 17:06 juan-fdz-hawa

QA Notes

I was able to test wiping a windows host using the following API command and payload:

POST /api/v1/fleet/hosts/:id/wipe

{   "windows": 
    { "wipe_type": "doWipe" } 
}

This wiped the Windows host and the host eventually recovered (after recovery and updates were automatically completed).

The following API command/payload wipes the host and results in an unbootable state: POST /api/v1/fleet/hosts/:id/wipe

{   "windows": 
    { "wipe_type": "doWipeProtected" } 
}

I did see the "Optional request body" addition to the docs, thank you! I've not tested the additional logging in the open PR. I can test that once it's been cherry-picked to the RC

jmwatts avatar Jun 23 '25 22:06 jmwatts

Additional QA notes

Tested logging:

level=debug ts=2025-06-26T16:36:17.849003Z msg="Windows host wipe request" wipe_type=doWipe
level=debug ts=2025-06-26T16:36:17.895251Z component=http [email protected] method=POST uri=/api/v1/fleet/hosts/11/wipe took=61.584791ms
level=debug ts=2025-06-26T16:40:57.430057Z msg="Windows host wipe request" wipe_type=doWipeProtected
level=debug ts=2025-06-26T16:40:57.488668Z component=http [email protected] method=POST uri=/api/v1/fleet/hosts/48/wipe took=80.156667ms

This one is good to go!

jmwatts avatar Jun 26 '25 16:06 jmwatts

Windows wipe, a risk, Fleet's gentle fix restores, Safe in cloud's embrace.

fleet-release avatar Jun 30 '25 23:06 fleet-release