bosh-agent icon indicating copy to clipboard operation
bosh-agent copied to clipboard

Lifecycle hooks can make the agent unresponsive

Open ionphractal opened this issue 1 year ago • 5 comments

Bosh-agent itself is already running with higher priority than BOSH/monit jobs to mitigate CPU-intensive workloads blocking the agent <-> director communication, see https://github.com/cloudfoundry/bosh-linux-stemcell-builder/commit/00054bd98693465dd75eda1f12a7326fc5191804 .

However, as it seems lifecycle hooks like pre-start scripts can as well have the same negative effect on the communication with the director because they are started by the bosh-agent itself and hence run with the same priority. At least this is my assumption because I wasn't able to find a line of code that lowers that priority and looking at a VM while it is running a pre-start reveals that the pre-start script with all sub-processes runs with the same priority as the agent.

In our case cloning a lot of data from the remaining part of a BOSH-managed PostgreSQL cluster can trigger this issue inconsistently, which in extreme situations extends downtime unnecessarily because the bosh task itself errors with an agent timeout and the pre-start has to run from scratch again.

Of course as a quick mitigation we could for example renice the priority in our pre-start script. Yet I would see benefit as well as consistency and hence predictability if bosh agent starts external scripts/binaries with lower priority than itself.

ionphractal avatar Nov 04 '24 16:11 ionphractal

@ionphractal this seems like a good idea! Happy to review a PR.

rkoster avatar Nov 07 '24 16:11 rkoster

@rkoster I found the things to adjust and tested it for linux stemcells and it worked. But I have questions regarding the best implementation...

Correct me if I'm wrong, but imho everything that is started by bosh-agent should be with lower prio. So I assume the GenericScript's Run function executes all commands? https://github.com/cloudfoundry/bosh-agent/blob/main/agent/script/generic_script.go#L91

I tracked this function down to bosh-utils https://github.com/cloudfoundry/bosh-utils/blob/master/system/exec_cmd_runner.go#L27.

I'm not sure where bosh-utils is used but I would say there is a reason why it has been pulled out. Now, if I changed the function directly, it would be a breaking change in a sense, especially if I changed the function interface to make it configurable? Or I could automatically derive it from the bosh-agent's priority value (with an out of bounds check) but it still would kind of change the expected outcome of the function. What do you think which way I should take?

  • A) Adjust the RunComplexCommand function and either
    • automatically derive executed command priority from the bosh-agent's (parent) process priority
    • OR adjust the function interface to make process priority configurable
  • B) Add a new function, e.g. RunComplexCommandNiced with either
    • configurable priority
    • OR automatic priority detection

But bosh-agent also supports Windows, which we don't use (tbh I don't know a way to test) and apparently would require extra libraries to work. Do you think we have to support it on Windows as well?

ionphractal avatar Mar 04 '25 12:03 ionphractal

I would go with option B with a configurable priority. This way there are not breaking changes and bosh-utils consumers and explicitly opt-in to this behavior by using the new function.

rkoster avatar Mar 31 '25 07:03 rkoster

After giving it some more thought you could also add a field to: https://github.com/cloudfoundry/bosh-utils/blob/master/system/cmd_runner_interface.go#L8 to optionally reduce the process priority. Then, when this field is set get the priority of the agent itself, and use it to set the priority between execution and the wait.

Cross-platform process priority manipulation can be done using: https://pkg.go.dev/github.com/hekmon/processpriority

rkoster avatar Mar 31 '25 07:03 rkoster

Sounds good, I'll give it a try with the extra field.

ionphractal avatar Apr 08 '25 08:04 ionphractal