neoq
neoq copied to clipboard
Garbage collecting completed jobs
What is an appropriate strategy for garbage collecting old jobs?
gue deletes on completion and suggests inserting into a new table. Gue suggests doing this in a hook, but doesn't explain how to handle the edge case of a hook panic (job will get automatically retried).
I like the concept of moving data into an archive table (as suggested in the Gue README) but having the framework manage that to ensure error conditions are handled well.
I think you're right that GC is best left to the library. This request has been sitting in my mental queue for a while.
How I'd like to proceed is:
- Be opinionated: make it "the neoq way" that jobs get moved to
neoq_jobs_completeby default. - Provide a configuration option that allows jobs to simply be purged instead of moved to
neoq_jobs_complete
Thoughts?
That sound perfect to me. Should it be neoq_jobs_completed past tense?
For actual GC of the completed table, I think it is fine to leave that to the user for now as an out of band, and some users might want to keep it that way. For example if delete permission was not be available to neoq, then actual GC would fail and you would be back to some kind of error handling callback. It is also possible to provide an example neoq periodic job that does the completed GC. If that job were to fail then job failure monitoring would kick in.
It seems needed for backwards compatibility to have an option to maintain the current behavior. I can see how there might be a use case where the existing behavior is preferred because everything is in just one table. Although that seems limited if the tables can be combined with UNION.
That sound perfect to me. Should it be
neoq_jobs_completedpast tense?
Yep!
For actual GC of the completed table, I think it is fine to leave that to the user for now as an out of band, and some users might want to keep it that way. For example if delete permission was not be available to neoq, then actual GC would fail and you would be back to some kind of error handling callback. It is also possible to provide an example neoq periodic job that does the completed GC. If that job were to fail then job failure monitoring would kick in.
I'd like to give this more thought, but I what I feel most inclined to do here is provide a WithJobArchive config option which would accept options such as MOVE, KEEP, DESTROY.
This behavior would likely live outside the job's transaction, but dependent upon its success. The reason being: if the job succeeds and its status is updated, we wouldn't want archival to affect the job's success. Failure to archive is a neoq problem; not the application embedding it.
It seems needed for backwards compatibility to have an option to maintain the current behavior. I can see how there might be a use case where the existing behavior is preferred because everything is in just one table. Although that seems limited if the tables can be combined with UNION.
I don't feel a strong obligation relative to neoq behavior until it hits 1.0. I have some obligation relative to API stability at this point (I guaranteed no changes to the neoq.Neoq API before 1.0). Currently, my thinking about when it'll reach 1.0 is in another issue. That writeup probably needs to live somewhere else; perhaps in the Status section of the README. I'll give this some thought as well.
This behavior would likely live outside the job's transaction, but dependent upon its success. The reason being: if the job succeeds and its status is updated, we wouldn't want archival to affect the job's success. Failure to archive is a neoq problem; not the application embedding it.
I like that. At some point it is a user problem though, so the mechanisms for reporting the problem to the user need to be well defined. Logging at error level is a standard fallback of course. But I use Sentry for error monitoring and it would be nice to have a hook to send such errors there, although I could make it with a user-defined periodic job that would query for both jobs in the completed state and jobs not archived (at some point these errors fall back to logging if reporting to Sentry doesn't work).
Yeah I think a well-structured, error-level log is the bare minimum here. I'm hesitant to provide a use-case-specific hook for this, but conceptually, I think a hook is the best way to truly handle failure.
That brings us to a much broader topic of generalized hooks, which I'm personally not prepared to tackle without some outside help :)
Pulling this out of "In Progress", as I think better telemetry and/or event "hooks" should come first.