fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Increase timeout limit for scripts

Open nonpunctual opened this issue 1 year ago • 16 comments

Admins do not wait for scripts to run. There should be no conflation around the idea of Fleet doing something "bad" when or if there is not an instantaneous script result from a host.

See: https://github.com/fleetdm/fleet/issues/9583#issuecomment-1924196025

I think we should look at how the script features are implemented, but, until that work can be done the proposal is to increase limits:

  • script character limit = not less than 1MB of data
  • script timeout = not less than 30m

image

https://docs.google.com/document/d/1Znyp2a9qcM9JdYHrzLudvcPwEdhnCg7RiKi22s8yGWw/edit

nonpunctual avatar Feb 07 '24 02:02 nonpunctual

Crowdstrikes powershell install script is 520 loc.

harrisonravazzolo avatar Feb 07 '24 18:02 harrisonravazzolo

@nonpunctual, @dherder, @harrisonravazzolo, and @mikermcneil I filed a separate "Increase character limit for saved scripts" story and pulled it into the current design sprint here: #16668

This way, we can move quickly on the script character limit which is blocking customers workflows.

I updated this story to only cover increasing the timeout limit for scripts. Do we know of any customers that are blocked by the 5 minute timeout? What scripts are they trying to run?

noahtalerman avatar Feb 08 '24 14:02 noahtalerman

@noahtalerman Thanks. I feel like we are going around in circles a bit on this.

I think we should assume that 5m isn't enough the same way we are going to assume that 10000 characters isn't enough. In my opinion, there should not be arbitrary limits on the scripts we allow customers to run.

Other products don't seem to have these limitations. Ideally, I think we should address the design choices that led us to them. but, for now, adjusting these limits is good for customer needs.

nonpunctual avatar Feb 08 '24 15:02 nonpunctual

@nonpunctual it makes sense to bump the timeout. Bringing this to feature fest.

design choices that led us to them

For timeouts, we want to save the users from themselves. If a script never ends then this will prevent all other scripts from running.

noahtalerman avatar Feb 09 '24 15:02 noahtalerman

Then we perhaps need to reconsider the queuing feature.

It's good to be prescriptive & opinionated & drive the user towards certain behaviors.

But on the admin / user side I don't know if the script feature is the place for it. There is a sense in which we have to assume the admin knows what they are doing or what they intend to do & we should let them do it.

I like the idea of the queue but I believe there is a fundamental misunderstanding of how scripts are used in this design. If we make every script contingent on every other script & in my queue are scripts that have nothing to do with each other, we are creating contingency for no reason.

nonpunctual avatar Feb 09 '24 15:02 nonpunctual

Heads up @nonpunctual , this feature request was brought to feature fest on 2024-02-15 and wasn't prioritized for the current design sprint.

noahtalerman avatar Feb 19 '24 14:02 noahtalerman

Hey @harrisonravazzolo, curious to get your feedback on this one.

Have y'all run into any scenarios in which a script you were trying to run was cut off after 5 minutes? If yes, what was the script?

noahtalerman avatar Feb 20 '24 22:02 noahtalerman

Hey @noahtalerman - haven't run into this one yet as the script we want to run is over the 10k char limit. This will be the first time we use the script feature in Fleet

harrisonravazzolo avatar Feb 20 '24 23:02 harrisonravazzolo

UPDATE: We chose to push this out of the design sprint. Why? We haven't heard of a customer running into issues w/ the 5 min limit yet. Once we get this feedback we can adjust. We want evidence that 5 minutes is too short before we make changes.

Hey @nonpunctual heads up, this story was prioritized during feature fest.

Aiming to ship an improvement in the next 6 weeks.

noahtalerman avatar Mar 08 '24 15:03 noahtalerman

UPDATE: We chose to push this out of the design sprint. Why? We haven't heard of a customer running into issues w/ the 5 min limit yet. Once we get this feedback we can adjust. We want evidence that 5 minutes is too short before we make changes.

FYI @nonpunctual

noahtalerman avatar Mar 08 '24 16:03 noahtalerman

Ok. I feel like we are going to be forcing Support to handle this problem. I don't think we have a representative sample of customers using the script feature to make this decision based on customer feedback.

I feel pretty strongly that this design for scripts should be revisited. Thanks.

nonpunctual avatar Mar 08 '24 16:03 nonpunctual

@nonpunctual, please let me know right away when a customer runs into the 5 minute timeout so we can prioritize changing the product.

We decided to prioritize other feature requests over this one because no customers are feeling the pain (yet).

Once they do, we'll follow up quickly with an improvement.

noahtalerman avatar Mar 12 '24 14:03 noahtalerman

@noahtalerman comment from customer-flacourtia: Screenshot 2024-05-15 at 2 34 16 PM

Customer-preston has also reported running into the 5m limit.

nonpunctual avatar May 15 '24 18:05 nonpunctual

If we enable longer script execution times as a synchronous scripting option, we'll also have to make sure to extend load-balancer timeouts to slightly exceed this timeout. For example 305 seconds on existing 5-minute timeouts is what we use in cloud.

This would not be needed if an asynchronous/callback method were leveraged for longer-running scripts.

rfairburn avatar May 16 '24 07:05 rfairburn

related: Timeout script remains in upcoming activities without displaying and BLOCKS other scripts to be executed https://github.com/fleetdm/fleet/issues/19059

customer-preston has again reported issues with 5m timeout on WIndows.

nonpunctual avatar May 20 '24 14:05 nonpunctual

Want to reiterate emphasis on this issue per meeting with customer-preston 20240522.

nonpunctual avatar May 22 '24 15:05 nonpunctual

customer-preston:

Screenshot 2024-05-28 at 10 55 03 AM

nonpunctual avatar May 28 '24 14:05 nonpunctual

So, @noahtalerman based on your comment from Mar 8, do the issues raised by customers regarding this feature since then clarify this issue? In my opinion, based on the feedback, this could probably be converted to a bug. Thanks.

nonpunctual avatar May 28 '24 15:05 nonpunctual

Screenshot 2024-05-30 at 1 56 28 PM

@noahtalerman @lukeheath added dogfood label to this issue per @spokanemac comments. Thanks.

nonpunctual avatar May 30 '24 17:05 nonpunctual

We could capture the PID. pid=$! So we could issue a kill command if needed.

spokanemac avatar May 30 '24 20:05 spokanemac

@nonpunctual this is still not classified as a bug even if users 'have issues with the 5m timeout'.

We understand that this is a frustrating experience when you are doing scripts that are longer than 5m but this is still working as intended and is not a bug. Bugs are a failure to execute a specific workflow that is supported. As of today scripts longer than 5m are not supported so this is still a story requesting the timeout be updated.

georgekarrv avatar May 30 '24 22:05 georgekarrv

From customer-preston:

doing script-based app management sometimes download multi-GB apps we use one script to install all customer apps Our app-install scripts timeout We consider it still running, since it's in the queue, so we do not attempt to run it again Any other script (for recovery key, MB agent, ...) is queued and NOT RUN since the queue can't be purged This is obviously a HUGE problem, since App Management is a key part of any MDM value prop, but especially on SMBs + this blocks any other form of script exec We really need solutions from you on this, starting with but not limited to: More permissive rules around script run time

nonpunctual avatar May 31 '24 02:05 nonpunctual

https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/

nonpunctual avatar Jun 03 '24 14:06 nonpunctual

@marko-lisica do you think this will make it into design review to be ready for estimation and get into 4.53? thanks!

zayhanlon avatar Jun 17 '24 14:06 zayhanlon

Hey @zayhanlon, I'm going to bring this story to design review tomorrow. I believe we should be able to get it through so we can estimate it on Wednesday.

marko-lisica avatar Jun 17 '24 15:06 marko-lisica

Hey @jacobshandling, re: our conversation yesterday, we decided to use seconds instead of minutes in error messages in the UI and CLI. Later we can improve this if necessary.

marko-lisica avatar Jul 05 '24 16:07 marko-lisica

@lukeheath @noahtalerman , I approved this story going into the RC since we have customers waiting for it. It is ETAed to end of today. We should probably add a P2 label so not to break the process (will have no real effect since we will wrap it up anyway) TMWYT

sharon-fdm avatar Jul 11 '24 17:07 sharon-fdm

  • [ ] Outdated documentation changes:
    • Remove the "Run live script" endpoint from the API reference docs. Document the new agent option in the agent configuration reference docs.

Hey @marko-lisica can you please take on the reference doc changes for this user story?

I assigned you to this issue.

noahtalerman avatar Jul 19 '24 19:07 noahtalerman

@zayhanlon, @pintomi1989, and @Patagonia121 heads up that this customer request was shipped in Fleet 4.54 🚀

Reference doc updates are still TODO (see comment above) however you can find out how to use this feature by checking out the Figma link in the issue description.

noahtalerman avatar Jul 19 '24 19:07 noahtalerman

Update PATCH /api/v1/fleet/config, POST /api/v1/fleet/spec/teams and POST /api/v1/fleet/teams/:id/agent_options to support script_execution_timeout under agent_options

Hey @marko-lisica, I think we're missing these updates to the API reference docs.

If that's the case, can you please open a PR to the reference docs when you get the chance?

noahtalerman avatar Jul 25 '24 22:07 noahtalerman