Adding additional Celery Exception Handling.
As ...
Austin - Network Automation Engineer
I want ...
We use Sentry for monitoring uncaught exceptions. As such the currently documented solution to fail jobs by throwing an uncaught exception is missing some detail as it will (often) create potentially unnecessary Sentry issues for any job that needs to be marked failed to alert the user of a failure condition the job encounters. We also make extensive use of Celery’s Canvas features such as sub-task groups and chains and need to generate new tasks based on retry-able failure modes so we want to avoid impacting that functionality.
So that ...
Allowing the Celery mechanism to update the Job Result Status could provide additional details that may be useful to operators and troubleshooters of failure events (both expected and unexpected). This would allow to explicitly mark a failure condition rather than implicitly trigger a failure based on the log level such as logger.error() or logger.exception().
I know this is done when...
I am able to write a job to capture these failure events and can customize the alerting based on the different types of failures of the celery worker.
def after_return(self, status, retval, task_id, args, kwargs, einfo):
self.celery_task.update_state(
state=JobResultStatusChoices.STATUS_FAILURE,
meta={
"exc_type": type(Ignore).__name__,
"exc_message": "",
"custom": "...",
},
)
raise Ignore()
I have attached a patch with the suggested modifications. celery-result-handling-patch.txt
Optional - Feature groups this request pertains to.
- [ ] Automation
- [ ] Circuits
- [ ] DCIM
- [ ] IPAM
- [ ] Misc (including Data Sources)
- [ ] Organization
- [X] Apps (and other Extensibility)
- [ ] Security (Secrets, etc)
- [ ] Image Management
- [ ] UI/UX
- [ ] Documentation
- [ ] Other (not directly a platform feature)
Database Changes
I am unsure of database changes.
External Dependencies
to my knowledge, an additional import would be required for this:
from celery.exceptions import Ignore, Reject
into:
nautobot/extras/jobs.py
This was reported originally via NTC-1893
We need to do some testing to ensure we fully understand how this interacts with django celery results and out own state management in the jobs.
To distill things, the ask is to allow for logging to work as it does today but to be able to opt out of the automatics job result state side effects and to have more explicit control on failure/exception state management.
Let's do a spike to investigate the implications with django celery results.
Distinguishing Failure Types: Uncaught Exceptions vs. Expected Job Failures
Consistent Job Failure API: Mark as failed (abort): Stops execution on a runtime error. Mark as failed (continue): Catches errors, handles or ignores them, logs the job as failed, and proceeds.
Logging Enhancements:
Opt-out option for the logger?
New custom log levels like log_success (log_failure, log_exception).
Note that exception is a standard Python logger level (https://docs.python.org/3/library/logging.html#logging.Logger.exception). But adding a failure level between warning and error would be helpful.