fleet Update process fails to complete on Windows machines with slow internet connection causing the Fleet service to get killed

Orbit version: 0.0.13

Operating system: Windows 10 (any patchlevel and version)

🧑‍💻 Expected behavior

The update process should complete during initialization

💥 Actual behavior

The Update process does not complete during initialization

More info

I recently encountered a field issue where the orbit service failed to start on more than 500 windows machines. The orbit log shows the following recurrent lines:

2022-09-20T13:03:27Z INF hash(orbit)=dd0fb6620146740b22047a921cfc9464e508d231b5f3487c6164148bedfa4a2b208c05513c079909912c2b407937363cb09011e331230af1cb37e47090e510a8
2022-09-20T13:03:27Z INF hash(osqueryd)=89528edc02f8d3c10f86e0900e2a06aa791c36cf9c5f92dbf3d2daf9d284bbef39e6dee3ee6ef37a7a62f2368432690a4272d8f693c2855f7e61d1197dac4898
2022-09-20T13:03:28Z INF update detected target=orbit

2022-09-20T13:04:02Z INF hash(orbit)=dd0fb6620146740b22047a921cfc9464e508d231b5f3487c6164148bedfa4a2b208c05513c079909912c2b407937363cb09011e331230af1cb37e47090e510a8
2022-09-20T13:04:02Z INF hash(osqueryd)=89528edc02f8d3c10f86e0900e2a06aa791c36cf9c5f92dbf3d2daf9d284bbef39e6dee3ee6ef37a7a62f2368432690a4272d8f693c2855f7e61d1197dac4898
2022-09-20T13:04:04Z INF update detected target=orbit

2022-09-20T13:04:37Z INF hash(orbit)=dd0fb6620146740b22047a921cfc9464e508d231b5f3487c6164148bedfa4a2b208c05513c079909912c2b407937363cb09011e331230af1cb37e47090e510a8
2022-09-20T13:04:37Z INF hash(osqueryd)=89528edc02f8d3c10f86e0900e2a06aa791c36cf9c5f92dbf3d2daf9d284bbef39e6dee3ee6ef37a7a62f2368432690a4272d8f693c2855f7e61d1197dac4898
2022-09-20T13:04:42Z INF update detected target=orbit

I found that the SHA512 orbit hash corresponds to orbit.exe version 0.0.13 (I crafted this quick-and-dirty powershell to find this out). From the logs, it seems that the update action flow triggered during orbit service initialization found that an update was required, so it started to download the update file but failed to complete. The update process failure can be verified by checking that the update start message here is logged, but that the update complete message here is not logged. Which effectively shows that the the actions within UpdateTarget never completed.

The failure is attributed to the windows Service Control Manager (SCM), which kills the orbit.exe process after taking more than 30secs to initialize (see that log shows that start attempts are attempted every ~35 secs). The 30secs is the default SCM service start timeout value. So if for some reason, the download process on the update runner here fails to complete within this timeframe, the orbit process gets killed.

This failure can be reproduced by installing an old orbit MSI package on a VM (as a way to force the update), and then throttling down the network speed of the VM internet connection (I forced the internet connection to be 1.5Mbps on my tests).

A potential fix for this will be to perform the service update action during service initialization in an asynchronous fashion (removing the blocking update action from the critical service initialization path), or by interacting with SCM APIs to inform that more time is needed during initialization (orbit should inform SERVICE_START_PENDING while the download is happening - see #7874 for more info on these required changes.

Sep 21 '22 14:09 marcosd4h

Hi @marcosd4h, thanks for the great analysis.

The failure is attributed to the windows Service Control Manager (SCM), which kills the orbit.exe process after taking more than 30secs to initialize

Now am wondering how does SCM know when Orbit is "initialized"? I cannot find any code in Orbit related to SCM.

Sep 22 '22 14:09 lucasmrod

I asked the same question and Marcos guessed that SCM considers it initialized when it execs the osquery process. This seems plausible, but we did not confirm.

Sep 22 '22 15:09 zwass

@zwass @lucasmrod I further investigated this and found that it was indeed because of the osqueryd.exe that SCM considers Fleet service to be initialized.

So to determine this, I first use ETW to trace the behavior of the SCM. On this ETW trace here, it can be seen that the SCM process (PID 644 - services.exe) is being asked to update the status of the Fleet Service. This means that someone should have called SetServiceStatus() on behalf of Fleet service. So next step here was to determine who was calling this out. For this, I inspected the RPC calls made by orbit.exe and osqueryd.exe. RPC (aka MSRPC) is the internal protocol many subsystems use on windows, including the SCM API. So to find who is calling SCM through RPC I used an RPC Monitor tool.

On the trace below, you can see the RPC calls made by orbit.exe, and the RPC calls made by osqueryd.exe. One of the first RPC calls made by osqueryd.exe to services.exe (SCM) is the undocumented RPC RSetServiceStatus() call to update the status of its service.

Orbit.exe RPC calls

Osqueryd.exe RPC calls

So it is osqueryd.exe that ends up talking with the SCM and setting the Fleet Service status (SCM threats both orbit.exe and osqueryd.exe as being a single service due to parent-child relationship). The calls to the SCM on osqueryd codebase can be observed here and here. I did a quick test commenting out this block and could confirm that the service status was not set anymore on the windows SCM once osqueryd.exe does not perform these calls anymore.

Knowing that osqueryd.exe is calling the SCM to update the status of the service and that this does not fully reflect the initialization status of orbit.exe, I think we need to do two things here:

Implement the SCM calls on orbit.exe as suggested on #7874 .
As the osqueryd.exe lifecycle is managed by orbit.exe, there should be no calls to the SCM from osqueryd.exe. To fix this, we need to push a PR on osquery codebase to restrict osqueryd.exe from talking with the SCM when a specific command line argument is passed to osqueryd.exe (Something like --no-scm-calls should work)

Sep 22 '22 22:09 marcosd4h

Great research! I think we can probably tackle #7874 first and then work on getting the patch into osquery? We will still need to account for older osquery versions that don't yet have the ability to disable the service controls.

Sep 22 '22 22:09 zwass

Thanks for the analysis (bis), am learning a lot!

there should be no calls to the SCM from osqueryd.exe.

What would happen if we implement (1) (https://github.com/fleetdm/fleet/issues/7874), but not (2)?

Sep 26 '22 10:09 lucasmrod

Hey guys, I've just pushed a PR to address this issue. See #7934.

@zwass I agree that it is better to handle this change on Orbit first and then tackle osqueryd. @lucasmrod If we don't implement (2) we will have duplicated SetServiceStatus calls to signal the service initialization status (one coming from Orbit and one coming from osqueryd). As the one from Orbit came first, the second call made by osqueryd one has no side effects as the service is marked as started already. So required code change is more around avoiding unnecessary calls when needed. I also tested what happened during the service stop request, and didn't find any issues there either. I added a few end-to-end tests to Github CI to ensure that the service behaves as expected and that the service startup time is within the expected time window.

Sep 26 '22 13:09 marcosd4h

fleet fleet copied to clipboard

Update process fails to complete on Windows machines with slow internet connection causing the Fleet service to get killed

🧑‍💻 Expected behavior

💥 Actual behavior

More info

fleet
fleet copied to clipboard