Keep locks after reading job status

Open mkindahl opened this issue 2 years ago • 5 comments

When reading the job status table bgw_job_stat and after that updating it, locks where released after the read, allowing a competing session to update the job status and trigger a concurrent update error either in a session doing the update or in the scheduler. Since the scheduler does not recover after aborting with an error, this caused the background worker subsystem to stop and not start new jobs.

This commit fixes this by upgrading RowExclusiveLock to ShareRowExclusiveLock to ensure that not two sessions tries to update the row at the same time, remove an initial speculative lock that are taken when a job status row can be added, and also keeps the lock until the end of the transaction to prevent other sessions to update. Since these updating transactions are short, it should not cause other threads to block long.

The commit also set a restart time for the scheduler to 10 seconds to allow it to restart on error and prevent the system from stopping completely.

Fixes #4293

Jun 17 '22 11:06 mkindahl

Codecov Report

Merging #4454 (27340c2) into main (f6dd55a) will decrease coverage by 0.04%. The diff coverage is 90.00%.

:exclamation: Current head 27340c2 differs from pull request most recent head 349991c. Consider uploading reports for the commit 349991c to get more accurate results

@@            Coverage Diff             @@
##             main    #4454      +/-   ##
==========================================
- Coverage   90.89%   90.84%   -0.05%     
==========================================
  Files         224      221       -3     
  Lines       42427    40751    -1676     
==========================================
- Hits        38562    37021    -1541     
+ Misses       3865     3730     -135

Impacted Files	Coverage Δ
src/bgw/scheduler.c	`82.64% <0.00%> (+1.06%)`	:arrow_up:
src/bgw/job.c	`93.04% <100.00%> (-1.38%)`	:arrow_down:
src/bgw/job_stat.c	`89.23% <100.00%> (+2.37%)`	:arrow_up:
tsl/src/nodes/compress_dml/compress_dml.c	`90.47% <0.00%> (-9.53%)`	:arrow_down:
tsl/src/bgw_policy/retention_api.c	`88.88% <0.00%> (-5.40%)`	:arrow_down:
src/hypercube.c	`93.69% <0.00%> (-3.81%)`	:arrow_down:
src/loader/bgw_message_queue.c	`85.52% <0.00%> (-2.64%)`	:arrow_down:
tsl/src/bgw_policy/compression_api.c	`83.13% <0.00%> (-2.59%)`	:arrow_down:
tsl/test/src/test_dist_util.c	`85.71% <0.00%> (-1.79%)`	:arrow_down:
tsl/src/fdw/deparse.c	`69.49% <0.00%> (-1.61%)`	:arrow_down:
... and 80 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f6dd55a...349991c. Read the comment docs.

Jun 17 '22 12:06 codecov[bot]

This patch solves the issue with tuples concurrently updated, but there remains a bug where two identical rows with the same job_id are added, which causes the scheduler to exit. Adding a restart time for the scheduler would prevent this from stopping the entire scheduling system, but understanding why it happens would be even better.

Jun 28 '22 13:06 mkindahl

Thanks for working on this bug and finding a solution! I understand that the current approach does not entirely solve the underlying issue but that it should already provide a remedy to the issue we're facing. Any chances this could go into a release any time soon?

Aug 01 '22 12:08 daniel-k

Thanks for working on this bug and finding a solution! I understand that the current approach does not entirely solve the underlying issue but that it should already provide a remedy to the issue we're facing. Any chances this could go into a release any time soon?

@daniel-k I had to put this on hold temporarily. I'll see if I can pick it up this week. Does that mean you have managed to verify that the patch solves the issue you have?

Aug 01 '22 12:08 mkindahl

Does that mean you have managed to verify that the patch solves the issue you have?

Unfortunately no. Is there a build of the TSDB extension based on this branch so I could easily install it in our dev/testing system? And would I be able to go back to the old version afterwards? We're running self-hosted PG14 on Ubuntu 20.04 currently with TSDB 2.6.1.

Aug 01 '22 13:08 daniel-k

@mkindahl I've deployed the new 2.8.1 release a couple of days ago and our bug seems to be solved. Thanks a bunch!

Oct 13 '22 13:10 daniel-k

@daniel-k Glad to hear that it worked. :)

Oct 17 '22 06:10 mkindahl

timescaledb timescaledb copied to clipboard

Keep locks after reading job status

Codecov Report

timescaledb
timescaledb copied to clipboard