cloud-init
cloud-init copied to clipboard
Make exception handling explicit
Impetus
Following the transition to the new exit code (2) for easier user introspection of cloud-init, there was a request to audit current callsites for error conditions that are logged at too high or low of a logging level. This PR starts that effort, but only scratches the surface. Much more work needs to be done auditing log levels.
Objective
This PR is the result of an audit of exceptions of type Exception
. This audit had a simple goal: prevent cloud-init from swallowing exceptions. We don't want exceptions such as AttributeError
, IndexError
, KeyError
, TypeError
, ZeroDivisionError
to be ignored when the exception that we expect to handle is a failed filesystem operation - this would mask important errors at runtime.
Rules
To complete the objective, a set of rules was defined and implemented. They are described in one of the commit messages:
Require a minimum of one of the following in each Exception handler:
- Log a warning at level WARN or higher containing the exception type and message
- Re-raise the exception without redefining its type.
By following these rules, we can expect that every Exception
handler will either issue its own warning or raise to another exception which will raise a warning. By doing this, silently ignoring important classes of exceptions will not happen.
Any log of WARN
would suffice for ensuring that these exceptions are logged, however standardizing this message across the codebase simplifies code auditing and makes it clear to developers when they see this message what has happened.
Methodology
Inspect the code for sources of expected exceptions and add them above the Exception
handler, and adding a proper WARN
log to the existing Exception
exception. This allows us to flesh out any unexpected changes easily in integration testing without risking unintentional logic changes.
Reviewer notes
I recommend reading the individual commit messages while reviewing. This PR carries some risk of changing cloud-init logic, but it attempts to mitigate the risk by retaining Exception
(in most cases) while making all Exception handling into a recoverable error. I believe that the risk that this poses will (after thoroughly reviewed) be outweighed by the current risk we have of unintentionally silencing unexpected exceptions.
A few things to keep in mind while reviewing:
-
subp.subp()
throwsOSError
-
OSError
is thrown by many filesystem and operating system operations. It is used liberally in this PR. - I don't like the repetitive
LOG.warning("Unhandled...")
. Ideas are welcome for better messaging and maybe a helper function would make sense - maybe something inlog.py
? - Perhaps the rules above should probably go in a dev doc?
Merge type
- [ ] Squash merge using "Proposed Commit Message"
- [x] Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)
Thanks for the review @TheRealFalcon! Great feedback all around.
Your LOG.warning("Unhandled exception: %s", e) rule is a bit too rigid to me. There are a bunch of places that had something like:
That's fair. It made auditing the code easier to implement this way, but we don't need to be so strict. I've updated the rules in the description above to say:
Require a minimum of one of the following in each Exception handler:
- Log a warning at level WARN or higher containing the exception type and message
- Re-raise the exception without redefine its type.
I want to keep the rule list short so that it is easy to remember, while being obvious and clear enough that it doesn't require much interpretation when trying to write or review code. Let me know what you think, suggestions are welcome.
My other major concern is logging new warnings that will cause cloud-init to exit 2. There are a number of places (random example) where we'll get a new warning where we didn't before.
I have some concerns about that too. In the past we've allowed some behavior changes that are bug fixes. I would consider properly warning (and exit 2) about a previously unreported error condition to be a bug fix. There is risk of big unknowns that crop up from this change, but I'm happy to be slow and meticulous with this PR if that will help us build confidence about its direction and consequences.
I'm not sure I agree with these unhandled exception rules when it comes to CLI commands. The exception is being printed/handled using the error() call below this one, and does it in a more graceful way. I think a log message on the CLI like the following will just add more noise:
2024-05-01 18:10:39,200 - clean.py[WARNING]: Unhandled exception: ...
Agreed, applying the rule feels clunky here and doesn't improve the end user's experience. Unfortunately, writing to stderr
wouldn't be an appropriate way to log unhandled warnings outside of CLI commands. The short/obvious/clear thing breaks down when trying to draw a box around where cli commands start and end, primarily for the following two reasons:
- cloud-init imports code back from
cloudinit.cmd
into the rest of the codebase (read_cfg_paths
) - a big chunk of cloud-init's core runtime logic is implemented in
cloudinit.cmd.main
These are just symptoms of the codebase's haphazardous structure. Both of these points are undesirable for several reasons, but while 1) would be trivial to fix 2) would probably take some forethought and planning. I have some thoughts on this topic for how we can better structure the core codebase and how this would benefit the project in different ways, but I'll save that discussion for later.
For now maybe we add a rule that says something like: > - for code in cloudinit/cmd/ (excluding main.py and __init__.py): Print a message to stderr containing the exception type and message.
?
Maybe someday if we feel the urge to address 1) and 2) we can change the rule to say > - Code in cloudinit/cmd/: Print a message to stderr containing the exception type and message.
Thoughts?
From a high level, I think everything you said makes sense. Thanks for laying it all out. I just want to make sure I understand. By Log a warning at level WARN or higher containing the exception type and message
, you mean that we don't need a separate handler, correct? Something like this could be ok?
try:
do_something()
except Exception as e:
LOG.warning("oh no! %s", e)
recover()
If so, +1. If not, let me know what I'm missing.
If we want to have hard rules around this behavior, I think:
- Let's document it. Maybe our docs could use a section for such coding guidelines? (doesn't have to be this PR)
- I also want to be flexible where needed. I think as long as we have a comment explaining why such a rule doesn't apply (and a reviewer agrees), we should be ok with occasionally breaking the rules.
From a high level, I think everything you said makes sense. Thanks for laying it all out. I just want to make sure I understand. By
Log a warning at level WARN or higher containing the exception type and message
, you mean that we don't need a separate handler, correct? Something like this could be ok?try: do_something() except Exception as e: LOG.warning("oh no! %s", e) recover()
If so, +1. If not, let me know what I'm missing.
Yes, that would be just fine. I think we're on the same page about this.
If we want to have hard rules around this behavior, I think:
1. Let's document it. Maybe our docs could use a section for such coding guidelines? (doesn't have to be this PR)
+1 Agreed. I can do that in a follow-up PR
2. I also want to be flexible where needed. I think as long as we have a comment explaining why such a rule doesn't apply (and a reviewer agrees), we should be ok with occasionally breaking the rules.
+1 I think the rules are best followed in most cases, but exceptions should be allowable. Even in this PR I break this rule in hotplug_hook.py
. In that case it follows the spirit of the rule by eventually raising something if an unknown exception is caught, and it's probably not safe to reasonably assume that we know all possible exceptions that might be raised from that call site. Even if we got the exceptions right in this PR, future changes to any code called in that site might break things.
I do think that this PR a larger hurdle in risk of behavior changes than future code which follows these rules will have. That is because in future PRs, we won't have to worry so much about existing runtime exceptions which silently behave in a way that is currently expected by the user since any of those that exist in theory should get fleshed out by this PR.
Pre-merge requirements:
- Eliminate unnecessary
LOG.warning("Unhandled exception",)
-> @holmanb - Wait until after the 24.2 release conflicts again -> @holmanb
Follow-up:
- Document the rules
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.
If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.
(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)
@TheRealFalcon I forgot to mention that I'd also be okay with patching this out of older releases to mitigate the risk of this change. If we do see new bugs shake out of cloud-init in Oracular as a result of this, the resulting bug fixes will still benefit old releases.
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.
If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.
(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.
If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.
(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)
This has acquired some drift, as other changes have taken priority.
This is still on my list, however I'd like to add a helper to reduce the risk of new warnings introduced by this change by extending the version boundary concept.
setting to wip
label, blocked on https://github.com/canonical/cloud-init/pull/5573
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.
If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.
(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.
If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.
(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)