datajoint-python
datajoint-python copied to clipboard
Autopopulate 2.0
This PR introduces significant changes to the logic of DataJoint's jobs reservation/orchestration scheme - namely the autopopulate mechanism.
The PR aims to address issue described in #1243 - following the proposed solution 1.
I have tested this new autopopulate 2.0 mechanism in some production pipeline settings, and it works great!
In short, the new logic is outlined below
Enhancing the Jobs Table in DataJoint-Python
To address current limitations, we'll enhance the jobs table by introducing new job statuses and modifying the populate() logic. This approach aims to improve efficiency and maintain data freshness.
Modifying the Jobs Table
Expand the job statuses within the jobs table to include:
scheduled: For jobs that are identified and queued for execution.success: To record jobs that have completed without errors.
Dedicated schedule_jobs Step
Introduce a new, dedicated step called schedule_jobs. This method will be responsible for populating the jobs table with new entries marked as scheduled.
- Identifying New Jobs: This step will execute
(table.key_source - table).fetch("KEY")to identify new jobs. While this operation can be computationally expensive, it mirrors the current approach for job discovery. - Rate Limiting: To prevent excessive scheduling and resource consumption,
schedule_jobswill include a configurable rate-limiting logic. For instance, it can skip scheduling if the most recent scheduling event occurred within a defined time period (e.g., 10 seconds).
New populate() Logic
The populate() function will be updated to:
- Optional Scheduling: Optionally,
schedule_jobscan be called at the beginning of thepopulate()process to ensure the jobs table is up-to-date before work commences. - Fetching Scheduled Jobs: Instead of repeatedly hitting
key_source,populate()will fetch keys directly from the jobs table that have ascheduledstatus. - Execution and Status Update: For each retrieved key,
make()will be called. Upon completion, the job's status in the jobs table will be updated to eithererrororsuccess.
Addressing Stale or Out-of-Sync Jobs Data
The jobs table can become stale or out-of-sync if not updated frequently or if upstream data changes.
- Invalid Entries: If entries in upstream tables are deleted, existing entries in the jobs table might become "invalid." Similarly, if entries are deleted from the target table,
successjobs can also become "invalid." purge_invalid_jobsMethod: To handle this, a newpurge_invalid_jobsmethod will be added. This method will identify and remove these invalid entries from the jobs table, ensuring data integrity.
Keeping the Jobs Table "Fresh"
Maintaining a "fresh" jobs table is crucial for efficient operations:
- Frequent Scheduling: Regularly running
schedule_jobswill ensure that new tasks are promptly added to the queue. - Frequent Purging: Regularly running
purge_invalid_jobswill keep the table clean and free of irrelevant or invalid entries.
Trade-off: Both schedule_jobs and purge_invalid_jobs will involve hitting key_source, which can be resource-intensive. Users (or system administrators) will need to balance the desired level of "freshness" against the associated resource consumption to optimize performance.
For more detailed description of the new logic, see here