timescaledb
timescaledb copied to clipboard
Keep locks after reading job status
When reading the job status table bgw_job_stat
and after that
updating it, locks where released after the read, allowing a competing
session to update the job status and trigger a concurrent update error
either in a session doing the update or in the scheduler. Since the
scheduler does not recover after aborting with an error, this caused
the background worker subsystem to stop and not start new jobs.
This commit fixes this by upgrading RowExclusiveLock
to
ShareRowExclusiveLock
to ensure that not two sessions tries to update
the row at the same time, remove an initial speculative lock
that are taken when a job status row can be added, and also keeps the
lock until the end of the transaction to prevent other sessions to
update. Since these updating transactions are short, it should not
cause other threads to block long.
The commit also set a restart time for the scheduler to 10 seconds to allow it to restart on error and prevent the system from stopping completely.
Fixes #4293
Codecov Report
Merging #4454 (27340c2) into main (f6dd55a) will decrease coverage by
0.04%
. The diff coverage is90.00%
.
:exclamation: Current head 27340c2 differs from pull request most recent head 349991c. Consider uploading reports for the commit 349991c to get more accurate results
@@ Coverage Diff @@
## main #4454 +/- ##
==========================================
- Coverage 90.89% 90.84% -0.05%
==========================================
Files 224 221 -3
Lines 42427 40751 -1676
==========================================
- Hits 38562 37021 -1541
+ Misses 3865 3730 -135
Impacted Files | Coverage Δ | |
---|---|---|
src/bgw/scheduler.c | 82.64% <0.00%> (+1.06%) |
:arrow_up: |
src/bgw/job.c | 93.04% <100.00%> (-1.38%) |
:arrow_down: |
src/bgw/job_stat.c | 89.23% <100.00%> (+2.37%) |
:arrow_up: |
tsl/src/nodes/compress_dml/compress_dml.c | 90.47% <0.00%> (-9.53%) |
:arrow_down: |
tsl/src/bgw_policy/retention_api.c | 88.88% <0.00%> (-5.40%) |
:arrow_down: |
src/hypercube.c | 93.69% <0.00%> (-3.81%) |
:arrow_down: |
src/loader/bgw_message_queue.c | 85.52% <0.00%> (-2.64%) |
:arrow_down: |
tsl/src/bgw_policy/compression_api.c | 83.13% <0.00%> (-2.59%) |
:arrow_down: |
tsl/test/src/test_dist_util.c | 85.71% <0.00%> (-1.79%) |
:arrow_down: |
tsl/src/fdw/deparse.c | 69.49% <0.00%> (-1.61%) |
:arrow_down: |
... and 80 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update f6dd55a...349991c. Read the comment docs.
This patch solves the issue with tuples concurrently updated, but there remains a bug where two identical rows with the same job_id are added, which causes the scheduler to exit. Adding a restart time for the scheduler would prevent this from stopping the entire scheduling system, but understanding why it happens would be even better.
Thanks for working on this bug and finding a solution! I understand that the current approach does not entirely solve the underlying issue but that it should already provide a remedy to the issue we're facing. Any chances this could go into a release any time soon?
Thanks for working on this bug and finding a solution! I understand that the current approach does not entirely solve the underlying issue but that it should already provide a remedy to the issue we're facing. Any chances this could go into a release any time soon?
@daniel-k I had to put this on hold temporarily. I'll see if I can pick it up this week. Does that mean you have managed to verify that the patch solves the issue you have?
Does that mean you have managed to verify that the patch solves the issue you have?
Unfortunately no. Is there a build of the TSDB extension based on this branch so I could easily install it in our dev/testing system? And would I be able to go back to the old version afterwards? We're running self-hosted PG14 on Ubuntu 20.04 currently with TSDB 2.6.1.
@mkindahl I've deployed the new 2.8.1 release a couple of days ago and our bug seems to be solved. Thanks a bunch!
@daniel-k Glad to hear that it worked. :)