core
core copied to clipboard
Implement Abandoned Job Detection and Recovery
Parent Issue
https://github.com/dotCMS/core/issues/29474
Task
We need to enhance our job queue system to handle abandoned jobs. These are jobs that may have been interrupted due to server crashes, network failures, or other unexpected issues, leaving them in an inconsistent state.
Objective:
Implement mechanisms to detect abandoned jobs and provide recovery strategies to ensure system reliability and data consistency.
Proposed Strategies:
-
Job Heartbeats:
- Implement periodic heartbeat updates for running jobs
- Create a background process to identify jobs with stale heartbeats
-
Timeout Mechanisms:
- Add a
max_execution_time
field to job configurations - Implement a background process to check for jobs exceeding their maximum execution time
- Add a
-
Recovery Procedures:
- Develop a recovery process
- Identify jobs in inconsistent states and apply appropriate recovery actions
Additional Considerations:
- Ensure that abandoned job recovery doesn't conflict with distributed locking mechanisms
- Consider the impact on job queue performance and optimize where necessary
- Evaluate and document any changes to the system's fault tolerance and high availability characteristics
Proposed Objective
Core Features
Proposed Priority
Priority 2 - Important
Acceptance Criteria
- System can detect jobs that have been abandoned due to server crashes or other issues
- Abandoned jobs are automatically handled according to configured recovery strategies
- All new functionality is covered by appropriate tests
- System performance is not significantly impacted by new abandoned job handling processes