ec2-macos-init
ec2-macos-init copied to clipboard
Undocumented retries and inability to remove that functionality
By default, the ec2-macos-init application will retry execution of failed sections. This doesn't appear documented anywhere, and can cause unexpected side effects. As an example, I have a script that configures MacOS certificates and registers the instance as a runner. As the line of my shell script failed came after registration, the machine registered itself with gitlab 101 times before finally stopping. This behavior needs to be modifiable and well documented, as it can cause serious side effects if not known.
Example TOML added:
Name = "Execute_Startup"
PriorityGroup = 3 # Fourth group
RunOnce = true # Run once, ever
FatalOnError = false # Stop running Init if there is an error
[Module.Command]
Cmd = ["/usr/local/aws/ec2-macos-init/startup.sh"] # A simple command
RunAsUser = "ec2-user" # Run as ec2-user
And the shell script:
#!/bin/bash
curl -L https://www.apple.com/certificateauthority/AppleWWDRCAG3.cer > g3.cer
sudo security authorizationdb write com.apple.trust-settings.admin allow
sudo security add-trusted-cert -d -r trustAsRoot -k /Library/Keychains/System.keychain g3.cer
security set-key-partition-list -S apple-tool:,apple: -s -k "${password}" login.keychain-db
/usr/local/opt/gitlab-runner/bin/gitlab-runner register --non-interactive --url 'REDACTED'--executor 'shell' --run-untagged='false' --tag-list 'mac' --registration-token REDACTED
git lfs install
Note that git lfs install
failed due to a missing $HOME env var, causing it to import the certificate many times and register itself with gitlab 100 times.
My recommended fix would be to accept a 'RetryCount' option in the TOML to make this configurable, and explicitly set the default within the TOML, removing the const variable.
@mschultz-aofl a question unrelated to the issue here, but related to your script. The ${password}
you use there for the login keychain, would this be the password set by macos-init or where do you get this from? Asking as I have some credentials and certificates that need to be added to the login keychain, but need the password for this to work, so am a bit stuck on that point.
I use packer to generate the shell script and toss it in that path, with the password in there. It's a restricted path, and the password is set per machine so it's not really sensitive. Not that the Packer I use resets the password to a known password - it doesn't use the reset script that's a part of this tool.
So the setup is, within packer:
- Set password
- Reboot
- Create /usr/local/aws/ec2-macos-init/startup.sh with the password hard-coded on it
- Modify /usr/local/aws/ec2-macos-init/init.toml to remove password reset and enable the script above
- Create AMI
- Launch ami via Terraform
Hey @mschultz-aofl, thanks for opening this issue to track this. My first guess at what's happening is the logic behind ec2-macos-init's RetryOnce
option. When this is present in a module, ec2-macos-init will keep attempting to run the module (once per run) until it marks a successful run in the history. See this block in Module.ShouldRun()
: https://github.com/aws/ec2-macos-init/blob/79b17eaba10b88c541ce0f023a086f59a3d1930e/lib/ec2macosinit/module.go#L154-L166
I like your suggestion of adding a new config value to set the maximum number of attempts a module is run unsuccessfully. We'll need to do a bit more investigation into that addition as well as this issue as a whole.
So I don't think it's there. If you look here:
https://github.com/aws/ec2-macos-init/blob/e83f1a6ad84142b9c5d07e8ed5ad79af1e9c5923/lib/ec2macosinit/config.go#L93-L105
You can see it retries 100 times. I think just modifying this would be the easiest thing to do. However, I'm not too familiar with Go - but as to why it does this, I think either (a) history isn't re-read/modified in memory, so the running process doesn't pick up the failure, or (b), the 100 retry logic bypasses ShouldRun.
I don't that's the case for the module that you've provided as it sets FatalOnError = false
. When the module fails, ec2-maocs-init will continue to run and that InitConfig.FatalCount
type will not be incremented.
You can double check this by seeing what's written in the log file /var/log/amazon/ec2/ec2-macos-init.log
. If there's a fatal call due to the Execute_Startup
module you have defined, there should be a log line like the following:
Exiting after <runtime> due to failure in module [Execute_Startup] with FatalOnError set