fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Bug: Script timeout is currently not respected by Orbit which in turn does not respect the modified timeout.

Open sharon-fdm opened this issue 1 year ago • 8 comments

Fleet version:

Web browser and operating system:


💥  Actual behavior

On macOS a script timeout doesn't take into effect until the existing command in the script completes. For example, running a script including sleep 10 with a timeout of 5s will timeout after 10s, not after 5s. This may be desired behavior, as killing a command before completion could render the device in an unstable state. However this does not occur on Windows devices, as the implementation there is slightly different.

🧑‍💻  Steps to reproduce

on macOS

  • run script containing sleep 10 with a timeout of 5 seconds
  • the script run will not send a response until the 10 second mark stating that it timed out after 5 seconds

🕯️ More info (optional)

Part of this may be desired behavior as killing some commands (like an OS update) may render the computer in an unknown state. If that's the case, we should only document and adjust the error message returned to reflect the actual runtime interval.

Another proposed solution is to introduce a "force timeout" option when running a script.

sharon-fdm avatar Jul 08 '24 14:07 sharon-fdm

@sharon-fdm Reminder to use bug template and populate as much information as possible in the body section.

lukeheath avatar Jul 12 '24 22:07 lukeheath

Hey @zayhanlon heads up that a fix for this bug is not targeted to ship in the next Fleet release (4.46)

After sprint planning today, we decided to prioritize the "Support Zero Trust workflow w/ live queries: 6 queries on 13k hosts" story (#17379) instead.

cc @sharon-fdm

noahtalerman avatar Aug 05 '24 21:08 noahtalerman

Waiting for @mostlikelee to provide details

getvictor avatar Aug 12 '24 12:08 getvictor

@mostlikelee this is something we identified when working with the scripts. Could you please fill in the details on this. (I can't recall which OSs have this issue and in what conditions.)

sharon-fdm avatar Aug 29 '24 15:08 sharon-fdm

We may need some product input here, but the issue is that the script timeout doesn't take into effect until the existing command in the script completes. For example, running a script including sleep 10 with a timeout of 5s will timeout after 10s, not after 5s. This may be desired behavior, as killing a command before completion could render the device in an unstable state. However this does not occur on Windows devices, as the implementation there is slightly different. My findings were on macOS, and I suspect this also occurs on Linux.

@noahtalerman curious on your thoughts here as to when a script timeout should take effect.

mostlikelee avatar Aug 29 '24 16:08 mostlikelee

@nonpunctual @spokanemac curious your thought on this

mostlikelee avatar Aug 29 '24 18:08 mostlikelee

@mostlikelee, I have no idea how this actually works. How I expect this to work is to capture the PID of the process, log a timestamp, and fork a bg process that sleep for X timeout. At the end of the timeout, see if PID exists, and kill it.

spokanemac avatar Aug 29 '24 23:08 spokanemac

@mostlikelee

As an admin, I should be able to:

  • run a script locally to test it
  • have Fleet run the same script with the same behavior & results on hosts as my local test

That's all. I know that's not a direct answer but I think this is what Fleet admins who upload scripts expect. Thanks.

nonpunctual avatar Aug 29 '24 23:08 nonpunctual

@spokanemac @nonpunctual thanks for the feedback. I believe the script timeout config exists primarily to protect admins in case they accidentally write something in their scripts that take too long or never exits, like writing an infinite loop in their scripts or running a command that never ends. The current implementation should take care of the infinite loop, but will not protect against a command that never ends. Maybe that's a corner case we shouldn't be worrying about at the moment.

If we change the behavior to kill in progress commands, it has the possibility of being a footgun if the timeout is set to a low value and kills something important, like an OS update in progress. I'm guessing killing an OS update at the wrong time may render the device in a bad state.

mostlikelee avatar Aug 30 '24 14:08 mostlikelee

After speaking with @nonpunctual we believe the current behavior is indeed a bug. The bug details have been updated.

mostlikelee avatar Aug 30 '24 15:08 mostlikelee

Script timeout lapse, Like a falling leaf delayed, Now finds timely path.

fleet-release avatar Nov 12 '24 18:11 fleet-release