cloud-init Respect runtime file locations

Proposed Commit Message

fix: Fix runtime file locations for cloud-init

Update various hardcoded filepaths. Also make sure we
bootstrap our Paths() config correctly so that we read
from the configured rundir.

Fixes GH-4766

Additional Context

Fixes https://github.com/canonical/cloud-init/issues/4766

This doesn't fix the /etc file location issue, but it also shouldn't make it any worse than it already is.

That should be fixed in a follow-up PR.

Merge type

[x] Squash merge using "Proposed Commit Message"
[ ] Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

Jan 30 '24 00:01 holmanb

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

Feb 29 '24 00:02 github-actions[bot]

I wonder if it's worth instead just checking post ConfigMerger if merged._cfg["system_info"]["paths"]["run_dir"] != self.paths.run_dir and only re-run read_bootstrap_cfg if we've detected a change in run_dir.

This actually causes infinite recursion, since this code path is where self.paths gets defined, however we can check for change in run_dir as the trigger for re-reading config.

Feb 29 '24 01:02 holmanb

@blackboxsw I addressed your comments and fixed the tests, ready for re-review now.

Feb 29 '24 17:02 holmanb

@blackboxsw thanks for the review, I think I've addressed everything

Mar 07 '24 03:03 holmanb

The issue I'm seeing, and neglected to think about before this effort, is that our generator time-frame remains unaware of any system config run_path. This then leaves cloud-init in a split-brain scenario where our systemd generator time-frame or any early boot script invocation of ds-identify (BSD/Alpine) will store ds-identify artifacts in the default /run/cloud-init and any subsequent boot stages will observe system configuration (/etc/cloud/*) overrides and store the remainder of out python-based boot stages under the overridden system_info.paths.run_dir.

Without aligning generator timeframe run_dir artifacts and python-based boot stages with the system config overrides, cloud-init python boot stages will be unable to source the ds-identify output which determines whether cloud-init is enabled or disabled and the filtered datasource_list values.

Some thoughts, not certain how viable, but I think the generator time-frame functionality may need to be sorted in a separate prior to the python-based solution because, with the split-brain scenario, cloud-init boot stages will be looking in the wrong /run/cloud-init directory and not see the /customrundir/cloud-init/cloud.cfg which provides the filtered datasource_list as well as /customrundri/cloud-init/(enabled|disabled) files which prevent later boot stages from even running.

Ideas for fixes needed:

systemd/cloud-init-hotplugd.socket: maybe the biggest blocker as the ListenFIFO path uses a hard-coded /run/clou-init/hook-hotplug-cmd
ds-identify could use the check_config run_dir facility to attempt to use a configured run_dir. Note that check_config doesn't explicitly pay attention to YAML nested structures, so a formerly ignored top-level run_dir: something could override the official system_info:paths:run_dir nested YAML leaf even though the top-level run_dir key is ignored by python runtime as it's not the appropriate nested config value under system_info.paths
systemd/cloud-init-generator.tmpl: needs a function like ds-identify's check_config function (maybe worth separating set_run_path/check_config/read_uname_info into a /usr/lib/cloud-init/common.functions that both ds-id and generator can reuse?)

Here's a running diff that looks almost functional (though it as a direct cut-n-paste of the check_config/read_uname_info functions: generator.patch.txt

Mar 07 '24 04:03 blackboxsw

@blackboxsw great thoughts, thanks for the review.

The issue I'm seeing, and neglected to think about before this effort, is that our generator time-frame remains unaware of any system config run_path. This then leaves cloud-init in a split-brain scenario where our systemd generator time-frame or any early boot script invocation of ds-identify (BSD/Alpine) will store ds-identify artifacts in the default /run/cloud-init and any subsequent boot stages will observe system configuration (/etc/cloud/*) overrides and store the remainder of out python-based boot stages under the overridden system_info.paths.run_dir.

True. Although I don't think the path issue affects Alpine. And on BSD, ds-identify can independently behave as required due to set_run_path.

Without aligning generator timeframe run_dir artifacts and python-based boot stages with the system config overrides, cloud-init python boot stages will be unable to source the ds-identify output which determines whether cloud-init is enabled or disabled and the filtered datasource_list values.

Since ds-identify now uses /var/run for non-Linux kernels, this PR is more about making cloud-init's Python code capable of also respecting this run directory.

Multiple sources of truth is not great, but I think that with the current approach it will now at least be possible to make both use /var/run/.

Some thoughts, not certain how viable, but I think the generator time-frame functionality may need to be sorted in a separate prior to the python-based solution because, with the split-brain scenario, cloud-init boot stages will be looking in the wrong /run/cloud-init directory and not see the /customrundir/cloud-init/cloud.cfg which provides the filtered datasource_list as well as /customrundri/cloud-init/(enabled|disabled) files which prevent later boot stages from even running.

Ideas for fixes needed:

systemd/cloud-init-hotplugd.socket: maybe the biggest blocker as the ListenFIFO path uses a hard-coded /run/clou-init/hook-hotplug-cmd

I hadn't thought about this one. However, I'd be surprised if this one works on non-Linux, because udev is very Linux-specific. Searching about FreeBSD + udev I see that maybe xorg is using udev api with devd as a backend circa 2020, but I haven't found any official details about it.

ds-identify could use the check_config run_dir facility to attempt to use a configured run_dir. Note that check_config doesn't explicitly pay attention to YAML nested structures, so a formerly ignored top-level run_dir: something could override the official system_info:paths:run_dir nested YAML leaf even though the top-level run_dir key is ignored by python runtime as it's not the appropriate nested config value under system_info.paths

Using check_config for this scares me a little, but that could work. I'd be concerned about this because false positives would break all of ds-identify. Currently false positives will (I think) only result in possibly adding unwanted datasources to the list, and cloud-init's Python code has ds_detect() and a failover path which means that when this happens, cloud-init should usually still behave as intended. Maybe we limit the splash zone of this failure by checking for the existence of directories set by run_dir before using this directory? However this would mean that run_dir would also have to be a single-line k/v pair in order to work, so I'm not sure how much more we want to add reliance on our broken yaml parser.

systemd/cloud-init-generator.tmpl: needs a function like ds-identify's check_config function (maybe worth separating set_run_path/check_config/read_uname_info into a /usr/lib/cloud-init/common.functions that both ds-id and generator can reuse?)

Here's a running diff that looks almost functional (though it as a direct cut-n-paste of the check_config/read_uname_info functions: generator.patch.txt

+1 nice, thanks for putting that together

It is up to you whether we go forward with the current approach (given the points raised above), or try for a more complete run_dir-everywhere approach which depends on check_config. I think I'd lean towards moving forward with this and possibly iterating on the generator scripts separate from this PR, but I don't feel strongly about that.

Mar 07 '24 13:03 holmanb

finally got around to testing this, and the main issue I'm seeing is:

the contents of the former /run/cloud-init are now splattered in /var/run: https://gist.github.com/igalic/3c1cf15e203ddfc04d41a777c78ae093
cloud-init clean -s doesn't clean it up, but I guess that's okay, the reboot is supposed to do that.

Which, according to rcorder(8) it does not. We need to amend https://github.com/canonical/cloud-init/blob/f7c1c76104870b629d3b5e25618544431f7d8427/sysvinit/freebsd/dsidentify.tmpl#L11 to say cleanvar instead. (Unlike the comment that claims var_run should suffice)

Mar 14 '24 13:03 igalic

the contents of the former /run/cloud-init are now splattered in /var/run: https://gist.github.com/igalic/3c1cf15e203ddfc04d41a777c78ae093

that seems to be my fault, tho: https://github.com/canonical/cloud-init/blob/f7c1c76104870b629d3b5e25618544431f7d8427/config/cloud.cfg.tmpl#L334

Mar 14 '24 14:03 igalic

@holmanb here's a PR addressing my complaints: https://github.com/holmanb/cloud-init/pull/50

Mar 15 '24 11:03 igalic

@holmanb here's a PR addressing my complaints: https://github.com/holmanb/cloud-init/pull/50

great feedback, and thanks for the PR against this branch - I just merged it into this branch

Mar 18 '24 18:03 holmanb

n.b.: I've tested this on FreeBSD and it works as expected. Is there anything specific I should be looking out for?

Mar 28 '24 21:03 igalic

It is up to you whether we go forward with the current approach (given the points raised above), or try for a more complete run_dir-everywhere approach which depends on check_config. I think I'd lean towards moving forward with this and possibly iterating on the generator scripts separate from this PR, but I don't feel strongly about that.

Yes, I think we can move forward with this PR. I think this PR you have fixes the the highest-impact gaps to align run-time python config based on overrides provided in /etc/cloud/cloud.cfg.*. There are a couple of minor areas that I think we should track in new GH issues as run_dir override is not fully functional everywhere:

Let's file 3 issues for the following related to run_dir config overrides:

- Incomplete cloud-init status --format=yaml boot_status_code: unknowninstead of boot_status_code: enabled-by-generator because generator script in systemd environments hard-codes the marker files it writes out for generator enabled vs disabled flag files. Then our python code observes the configured run_path override and finds neither /myrun1/cloud-init/enabled nor /myrun1/cloud-init/disabled and reports unknown.
systemd/cloud-init-generator-tmpl awareness of cloud.cfg::system_info::paths::run_dir
tools/(hook-hotplug|cloud-init-hotplugd) and systemd/cloud-init-hotplugd.socket awareness of cloud.cfg::system_info::paths::run_dir resulting in both hard-coded FIFO @ default run_dir path /run/cloud-init/hook-hotplug-cmd as well as skipped hotplug events because hook-hotplug never recognized cloud-init as is_finished

We may be able to avoid the incomplete cloud-init status issue in this PR by also checking the for <default_run_dir>/enabled or <default_run_dir>/disabled files. But "the right fix" would be to somehow address this completely in the cloud-init-generator script and not band-aid a solution for cloud-init status to fallback to defaults. Ideally a ds-identify fix though would require smarter YAML processing to avoid false positives, and I don't want to introduce that here this more specific fix.

Given that overriding system_config::run_dir is generally a non-standard approach outside of BSD, and the BSD-cases are aligned with ds-identify, I think we are good to defer working those 3 issues separately for more complete support of run_dir overrides as lower-weight tasks.

Apr 01 '24 19:04 blackboxsw

Thanks for the review @blackboxsw.

Incomplete cloud-init status --format=yaml boot_status_code: unknowninstead of boot_status_code: enabled-by-generator because generator script in systemd environments hard-codes the marker files it writes out for generator enabled vs disabled flag files. Then our python code observes the configured run_path override and finds neither /myrun1/cloud-init/enabled nor /myrun1/cloud-init/disabled and reports unknown.

systemd/cloud-init-generator-tmpl awareness of cloud.cfg::system_info::paths::run_dir

tools/(hook-hotplug|cloud-init-hotplugd) and systemd/cloud-init-hotplugd.socket awareness of cloud.cfg::system_info::paths::run_dir resulting in both hard-coded FIFO @ default run_dir path /run/cloud-init/hook-hotplug-cmd as well as skipped hotplug events because hook-hotplug never recognized cloud-init as is_finished

All of these are systemd-specific code-paths, and systemd explicitly requires an ephemeral /run directory, so I don't think there is anything to fix related to these three points.

Apr 01 '24 19:04 holmanb

All of these are systemd-specific code-paths, and systemd explicitly requires an ephemeral /run directory, so I don't think there is anything to fix related to these three points.

Good point. Though some image could set run_dir: /run/mydir/ which would adhere to systemd's requiements, but still present a specialized run_dir in configuration. In these cases, some artifacts will still be placed in /run/cloud-init/ and others in /run/mydir/cloud-init leading to those errors I mentioned. But, I think me belaboring this point is really a limited return on investment for both of us because I honestly don't think there is a real use-case for changing the run_dir outside of our FreeBSD case at the moment. Given that non-Linux will align w/ /var/run we should be in good shape until a second use-case presents itself where modification of run_dir is necessary for some other reason.

Apr 01 '24 23:04 blackboxsw

Force pushed to resolve merge conflicts.

Apr 02 '24 13:04 holmanb