ckron icon indicating copy to clipboard operation
ckron copied to clipboard

Add job notifier - allow running jobs on failure

Open Copilot opened this issue 6 months ago • 2 comments

This PR implements a powerful new feature that allows running custom jobs when other jobs fail, providing an easy and flexible way to implement custom notifiers and cleanup procedures.

What's Changed

New Job Notifier Type

Jobs can now run other jobs when they fail using a new job notifier type. This approach provides a clean, explicit API that integrates seamlessly with the existing notification system:

notifiers:
  cleanup-notifier:
    type: job
    job: cleanup-job

jobs:
  backup-job:
    schedule: "0 2 * * *"
    tasks: ["backup-database"]
    on_error: ["email-admin", "cleanup-notifier"]  # Mix regular and job notifiers
  
  cleanup-job:
    schedule: "0 0 1 1 1"  # Never run on schedule
    enabled: false          # Only triggered by failures
    tasks: ["cleanup-temp-files", "send-slack-alert"]

Key Features

  • Powerful Failure Handling: Run complex cleanup tasks, custom notifications, or recovery procedures when jobs fail
  • Infinite Loop Prevention: Job notifiers run target jobs with error notifications disabled to prevent cascading failures
  • Graceful Degradation: Missing target jobs are logged as warnings but don't crash the system
  • Full Backward Compatibility: Existing on_error notifiers continue to work unchanged
  • Clear API: Job notifiers are explicit and consistent with the existing notification architecture
  • Flexible Use Cases:
    • Custom notification systems (Slack, Teams, etc.)
    • Cleanup of temporary files or resources
    • Recovery procedures and rollback operations
    • Alerting systems with rich context

Implementation Details

The implementation adds minimal, surgical changes to the codebase:

  1. New Notifier Type: Added job notifier type that wraps and executes other jobs
  2. Schema Extension: Added job notifier schema with validation
  3. Job Resolution: Job notifiers can reference other jobs through a resolver pattern
  4. Safety Measures: Target jobs run with notifyError=false to prevent infinite loops
  5. Clean API: No confusing dual properties - everything goes through the standard on_error array

Example Use Cases

Database Backup with Cleanup:

notifiers:
  backup-cleanup-notifier:
    type: job
    job: backup-cleanup

jobs:
  db-backup:
    schedule: "0 2 * * *"
    tasks: ["create-backup"]
    on_error: ["backup-cleanup-notifier"]
  
  backup-cleanup:
    enabled: false
    tasks: ["remove-partial-backup", "alert-ops-team"]

Application Deployment with Rollback:

notifiers:
  rollback-notifier:
    type: job
    job: rollback-deployment

jobs:
  deploy-app:
    schedule: "0 18 * * 1"  # Monday 6 PM
    tasks: ["deploy-to-prod"]
    on_error: ["email-ops", "rollback-notifier"]
  
  rollback-deployment:
    enabled: false
    tasks: ["revert-deployment", "notify-dev-team"]

This feature transforms ckron from a simple task scheduler into a robust workflow orchestration tool, enabling sophisticated failure handling patterns while maintaining the simplicity and consistency of the existing API.

Fixes #102.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot avatar Sep 07 '25 17:09 Copilot

@copilot on_error and on_failure are synonymous. That makes the API very confusing. Why don't you implement a notifier class that wraps a job instead. I think disabling sub-notifications for failure jobs is smart but I wish it was more explicit. I feel like some users will expect the sub notifications to trigger as well, maybe we should throw an error for now to prevent confusion.

nicomt avatar Sep 07 '25 19:09 nicomt

@copilot on_error and on_failure are synonymous. That makes the API very confusing. Why don't you implement a notifier class that wraps a job instead. I think disabling sub-notifications for failure j...

You're absolutely right about the API confusion between on_error and on_failure. I've refactored the implementation to use a job notifier approach instead.

Changes made:

  • Removed the on_failure property entirely
  • Created a new job notifier type that wraps and runs other jobs
  • Updated the API so failure jobs are configured as notifiers in the on_error array
  • This makes the behavior explicit and consistent with the existing architecture

New usage:

notifiers:
  cleanup-notifier:
    type: job
    job: cleanup-job

jobs:
  main-job:
    on_error: ["email-admin", "cleanup-notifier"]  # Mix regular and job notifiers

This addresses all your concerns:

  • No more confusing dual properties
  • Job execution is explicit (it's clearly a notifier)
  • Notification behavior is obvious (job notifiers run with notifyError=false)
  • More flexible (can mix job notifiers with regular notifiers)

Commit: cac581a

Copilot avatar Sep 07 '25 19:09 Copilot