elastic-ci-stack-for-aws icon indicating copy to clipboard operation
elastic-ci-stack-for-aws copied to clipboard

Add `CancelGracePeriod` param to aws template

Open chrisdothtml opened this issue 3 years ago • 10 comments

It seems this value isn't modifiable via environment variables, so it would be nice to be able to modify this value in the cloudformation config

chrisdothtml avatar Feb 02 '21 23:02 chrisdothtml

Thanks for the PR @chrisdothtml (and I'm also looking into the broader issues you're having with cancel-grace-period).

I'm a bit nervous about adding a CloudFormation parameter for this (and potentially every other) agent configuration value. I'm wondering if we should bring in some other way to override agent configuration. For example an SSM Parameter Store namespace could be specified, and the values therein could be written to the agent configuration during boot. Do you think that would work for your use-case? (any thoughts @yob @chloeruka @lox?)

If we do stick with this CloudFormation parameter, I believe you'll need to update packer/linux/conf/bin/bk-install-elastic-stack.sh and packer/windows/conf/bin/bk-install-elastic-stack.ps1 to explicitly read it from env and write it to the config.

pda avatar Feb 15 '21 01:02 pda

I share the concern about a parameter count explosion if we expose all the agent flags, but it's also not a great situation at the moment. Changing flags with a bootstrap script is possible, but super awkward.

For example an SSM Parameter Store namespace could be specified, and the values therein could be written to the agent configuration during boot. Do

I'm not familiar enough with SSM Parameter Store to know the tradeoffs involved here, but at face value it sounds like a neat solution with lots of flexibility for the future.

yob avatar Feb 15 '21 01:02 yob

The main trade-off I can think of, and not specific to Parameter Store, is that the parameters/config would be outside the CloudFormation model. Machines would pick up whatever config was set at their individual time of boot, rather than what is defined in the stack itself. So if the parameters are changed over time, a stack could end up with different config on different machines. I don't really see this as a big problem, though, and sometimes it might be advantageous.

pda avatar Feb 15 '21 02:02 pda

Agreed that doing agent configuration in the bootstrap is super awkward and that adding CloudFormation params for every agent config key would be really noisy. How about instead of an SSM param store you could just specify an S3 url for an agent config file, similar to the existing mechanism in BootstrapScriptUrl? Something like AgentConfigUrl

chrisdothtml avatar Feb 15 '21 07:02 chrisdothtml

One complication with both the SSM and S3 approach is that there's a handful of config options that maybe do need their own stack parameter (like agent token and disconnect-after-idle-timeout) or arguably shouldn't be configurable at all (like *-path). I guess we could explicitly exclude/override those values with stack parameters?

yob avatar Feb 15 '21 10:02 yob

The SSM/S3/whatever config (let's call it “user config” for now) would certainly need to be merged somehow with the base config file and the stack-driven config keys. But maybe the user config can just take highest precedence, and it's up to the user to not set values that are meant to be controlled elsewhere?

One benefit of SSM Parameter Store over S3 is that people could more easily use an outer CloudFormation / Terraform stack to manage the user config key/values, if they wanted. But it's true that a file in S3 would be more fitting with the existing elastic-stack tech, and perhaps more approachable to people unfamiliar with SSM.

pda avatar Feb 15 '21 23:02 pda

@pda Would you be amenable to overlooking the possible slippery slope toward exposing all the configuration values (which I think we agree is suboptimal) if I or someone else were to continue this PR? Might we be able to add support for adjusting (only) this agent configuration value (using the existing system), first, before we look at different ways of configuring all the other useful agent configuration values (whatever they are)?

One reason, in particular, that cancel-grace-period should be tunable (before/as opposed to making everything tunable) is that the default is dangerous in a number of scenarios (e.g. in mine, it cancels Terraform runs mid-flight and leaves a lock file behind; trying to get Terraform, even 1.x, to wrap up in ten seconds is a non-starter). You can see this in how the original PRs/issues talk a lot about 30 second or two minutes being a better default than 10 seconds, and how lox mentions it should probably be configurable per-step

I just think worrying about needing to expose all the configuration values is probably premature. YAGNI: I've not needed anything except cancel-grace-period (or things already exposed via CloudFormation parameters - there are 50 of them already!) in all the time we've been using the AWS stack.

I guess not having an issue open is muddying things. The defect should be: "can't configure cancel-grace-period at all when using elastic-ci-stack" - because that's what matters to the users. I have no need to configure anything else, but have been waiting for cancel-grace-period support in this project for years (and I don't want to build my own AMI or use a fork; the joy of a prebuilt stack is not having to maintain that!)

Even if it is warranted to expose more than is being asked for in this PR, it seems like a different PR/issue to open and track separately; not necessarily a blocking concern for adding this parameter. (Again, happy to update packer/linux/conf/bin/bk-install-elastic-stack.sh and packer/windows/conf/bin/bk-install-elastic-stack.ps1 or whatever is necessary, of course)

Sorry to give a lengthy spiel, but I don't want the perfect solution blocking this good enough one for another six months while there's very useful functionality behind this flag.

dominics avatar Jul 21 '21 04:07 dominics

For ease of testing: #881 is this PR with the fixes that were asked for to the bk-install-elastic-stack scripts (to write out the config value). Did this mainly so I can use it myself. Although, I'm curious, @yob said:

Changing flags with a bootstrap script is possible, but super awkward.

Could anyone describe how? Maybe I can use that? Do you have to relaunch the agent process somehow? Is there a signal to get it to re-read its configuration?

AFAICT by testing, by the time the bootstrap script is running, it's too late for most sensible methods to set this flag. So all that's left is custom cloud-init code, which is a whole 'nother level of difficulty/obtrusiveness. Is there an easier way?

dominics avatar Jul 21 '21 05:07 dominics

Could anyone describe how? Maybe I can use that? Do you have to relaunch the agent process somehow? Is there a signal to get it to re-read its configuration?

The BootstrapScriptUrl parameter is executed on every instance boot after the agent config is written but before the agent is started. So you should be able to append settings to the config file using the bootstrap script before the agent process ever starts up.

keithduncan avatar Jul 21 '21 07:07 keithduncan

append settings to the config file using the bootstrap script

Brilliant, the missing link was that we always tried to set environment variables using the bootstrap before. We were already using a bootstrap script, so it was a pretty easy workaround after all! Thanks for the pointer; the issue has a much lower priority for me now.

echo "cancel-grace-period=30" >> /etc/buildkite-agent/buildkite-agent.cfg

dominics avatar Jul 22 '21 01:07 dominics