gue icon indicating copy to clipboard operation
gue copied to clipboard

example does not work finished_jobs_log missing

Open bjartek opened this issue 2 years ago • 10 comments

CREATE TABLE IF NOT EXISTS finished_jobs_log
(
    job_id      BIGSERIAL   NOT NULL PRIMARY KEY,
    run_at      TIMESTAMPTZ NOT NULL,
    type        TEXT        NOT NULL,
    queue       TEXT        NOT NULL
);

bjartek avatar Sep 02 '22 11:09 bjartek

Also when I ran the default example the finisher panics but it just makes the job run endlessly, should the hookFunc return an error?

bjartek avatar Sep 02 '22 11:09 bjartek

Thx for reporting, let me try to reproduce this with the test and if I'll be able - I'll fix this issue.

should the hookFunc return an error?

it should not by design, pls check hook type description - https://pkg.go.dev/github.com/vgarvardt/gue/v4#HookFunc

vgarvardt avatar Sep 02 '22 18:09 vgarvardt

Tried to reproduce an issue with a test and I can not - I see a record in the finished jobs log table and panic is not interrupting the job flow.

Pls check https://github.com/vgarvardt/gue/pull/90, maybe you have an idea how to reproduce the problem.

vgarvardt avatar Sep 02 '22 18:09 vgarvardt

One issue is that in the example code or in the ddl there is no sql for the finished logs table.

bjartek avatar Sep 03 '22 06:09 bjartek

Ah, this is just an example of hook usage, not the production-ready code.

vgarvardt avatar Sep 03 '22 09:09 vgarvardt

Ok, well if you add the finished job table to the ddl it will work better imho.

bjartek avatar Sep 03 '22 10:09 bjartek

I just experienced a similar problem. I set up a finished_jobs_log table and tweaked it a bit. However, my INSERT query in finishedJobsLog introduced an SQL error, so the j.tx.Commit() in j.Done() never occurs and the error causes a rollback. Therefore, the job record remains in-tact, thus causing the job to re-run infinitely. As there is no way that I can find to capture this error, I was not aware it was occurring until I stepped through many lines of code using the debugger.

To reproduce:

  1. Create finished_jobs_log table:
CREATE TABLE IF NOT EXISTS gue_finished_jobs
(
	job_id      TEXT        NOT NULL PRIMARY KEY,
	job_type    TEXT        NOT NULL,
	queue       TEXT        NOT NULL,
	run_at      TIMESTAMPTZ NOT NULL,
	finished_at TIMESTAMPTZ NOT NULL
);
  1. Modify the INSERT query in finishedJobsLog like so:

_, err = j.Tx().Exec(
	ctx,
	"INSERT INTO gue_finished_jobs (job_id,job_type,queue,run_at,finished_at) VALUES ($1, $2, $3, $4, now())",
	j.ID,
	j.Type,
	j.Queue,
	j.RunAt,
)
  1. Run the example

Using j.ID for job_id will cause the error - to fix use j.ID.String()

BillBuilt avatar Oct 02 '23 16:10 BillBuilt

@BillBuilt added new hook type that should help spotting an issue like this https://github.com/vgarvardt/gue/pull/224. Although failure already produces an error message, so having alert on error messages in log collector is good practice in general. Another good practice that may help avoid issues like this is tests, in your case covering happy path could help avoiding the issue.

vgarvardt avatar Oct 02 '23 19:10 vgarvardt

@vgarvardt Hi, thank you for this new hook. Could you suggest how to introduce backoff in this situation?

  1. In job execution something goes wrong in SQL query.
  2. Transaction becomes "failed" thus not committed.
  3. Job falls into "failed" state and executed indefinitely, without any backoff.

That's what I want to avoid as it is definitely not good when job keeps executing with errors without any delay.

I see one of the solutions is to use some counter in my local memory to keep track if job "failed" and delay next execution. How would you recommend to do this?

TechGeorgii avatar Oct 20 '23 05:10 TechGeorgii

@TechGeorgii in-memory counter will work for sure, but it means that you expect the logic to fail in a non-recoverable way.

I would go towards more general approach - make an SLI from produced error messages count and include it into Error Budget calculation - it should be drained very fast, given that failed job will get into the infinite loop. If you have smart CD process then error logs count metric should be used as a decision point for automatic rollback. Or sudden spike in the error logs count must trigger the critical alert.

The main idea is not to workaround the issue, but to prevent the issue from affecting the prod, or spot it and mitigate/hotfix as soon as possible.

vgarvardt avatar Oct 24 '23 06:10 vgarvardt