kala icon indicating copy to clipboard operation
kala copied to clipboard

Prevent deleted job to run

Open n10ty opened this issue 4 years ago • 12 comments

This PR solving #237

As you can see in the original code: first job run, then checked Is It exists in the cache. We don't do anything if the job deleted: neither try to run nor reschedule it.

n10ty avatar Jan 22 '21 20:01 n10ty

Sorry, although I have done some maintenance work on this repo in the past I have not had much time recently and @ajvb has been absent.

Two points:

  1. The lint check is failing. (If this is because of #249 excuse me and I'll take the time to pull that one in.)
  2. I would like to see two commits: one with a failing test that reproduces this scenario, and a commit with your changes that fixes the failing test.

That's maybe a lot to ask but it would make me more comfortable re merging this. Thanks in advance.

tooolbox avatar Mar 11 '21 08:03 tooolbox

This isn't a test or two commits :) only a logs that i stored from yesterday

./kala serve --jobdb=postgres --jobdb-address=localhost:5432/db?sslmode=disable --jobdb-username=postgres --jobdb-password=12345678 --persist-every=360 INFO[0247] Job backend::xxx:e9272252-1221-4e8c-5e65-be9c0420ce8f repeating in 9.587743219s INFO[0257] Job backend::xxx:e9272252-1221-4e8c-5e65-be9c0420ce8f repeating in 2m59.984205812s INFO[0437] Job backend::xxx:e9272252-1221-4e8c-5e65-be9c0420ce8f repeating in 2m59.980811475s INFO[0481] Deleting e9272252-1221-4e8c-5e65-be9c0420ce8f INFO[0481] Job backend::xxx:50368b08-99cb-4eca-4c4a-eb63aa2ae235 repeating in 9.496309062s INFO[0491] Job backend::xxx:50368b08-99cb-4eca-4c4a-eb63aa2ae235 repeating in 2m59.982445547s WARN[0617] Job backend::xxx with id e9272252-1221-4e8c-5e65-be9c0420ce8f ran, but seems to have been deleted from cache; results won't be persisted. INFO[0617] Job backend::xxx:e9272252-1221-4e8c-5e65-be9c0420ce8f repeating in 2m59.979871709s INFO[0671] Job backend::xxx:50368b08-99cb-4eca-4c4a-eb63aa2ae235 repeating in 2m59.971442338s WARN[0797] Job backend::xxx with id e9272252-1221-4e8c-5e65-be9c0420ce8f ran, but seems to have been deleted from cache; results won't be persisted.

git cherry-pick e6c8d30 [develop 573129d] Prevent deleted job to run Author: Andrii Mazurian [email protected] Date: Fri Jan 22 22:41:34 2021 +0200 3 files changed, 33 insertions(+), 10 deletions(-)

./kala serve --jobdb=postgres --jobdb-address=localhost:5432/db?sslmode=disable --jobdb-username=postgres --jobdb-password=12345678 --persist-every=360 INFO[0016] Job backend::xxx:6dcc0d4b-a9b8-4337-63b1-acc0a9e9162a repeating in 9.652610885s INFO[0025] Job backend::xxx:6dcc0d4b-a9b8-4337-63b1-acc0a9e9162a repeating in 2m59.969222396s INFO[0205] Job backend::xxx:6dcc0d4b-a9b8-4337-63b1-acc0a9e9162a repeating in 2m59.956198826s INFO[0253] Deleting 6dcc0d4b-a9b8-4337-63b1-acc0a9e9162a INFO[0253] Job backend::xxx:eaa973b1-121d-4e34-6c96-c99bd031f8e2 repeating in 9.80399487s INFO[0263] Job backend::xxx:eaa973b1-121d-4e34-6c96-c99bd031f8e2 repeating in 2m59.980989207s INFO[0385] Job backend::xxx with id 6dcc0d4b-a9b8-4337-63b1-acc0a9e9162a tried to run, but exited early because its deleted INFO[0443] Job backend::xxx:eaa973b1-121d-4e34-6c96-c99bd031f8e2 repeating in 2m59.982147681s

josejuanmontiel avatar Mar 11 '21 09:03 josejuanmontiel

@tooolbox @n10ty When are we gonna get this change in? I'm currently sort of blocked by this.

OUCHUNYU avatar Nov 18 '21 03:11 OUCHUNYU

Actually I found an issue @n10ty if you have a job running, say once every 5 min, before the next job runs, stop the server, and wait for > 5 min, restart the server, you will see the job runs before the job is added to the cache, therefore you will not have the job run at all. I moved err := c.Set(j) before

if j.ShouldStartWaiting() {
        j.StartWaiting(c, false)
}

Then everything works as expected.

OUCHUNYU avatar Nov 18 '21 23:11 OUCHUNYU

I pulled in the lint-fixing PR from @josejuanmontiel

Since the logic for dealing with jobs has become quite complex, the only sane path forward is to have unit tests for every issue, showing how it's broken--and then fixed by a patch.

tooolbox avatar Nov 19 '21 20:11 tooolbox

Sorry, guys, I don't have enough time to work on it. I'm using forked version in production. Even now I don't really like how it works: after deletion job still remains in cache and just skips during next run. Feel free to make changes and add test to cover all cases. Also, I'm thinking about moving to GCP scheduler in future.

n10ty avatar Nov 22 '21 18:11 n10ty

Hi, i think the test in #255 should do the trick to show the error hope this could be sufficient... @tooolbox

josejuanmontiel avatar Dec 23 '21 00:12 josejuanmontiel

Hi @tooolbox now with lint errors in #255 resolved ... happy new year ;)

josejuanmontiel avatar Jan 09 '22 22:01 josejuanmontiel

Hey @josejuanmontiel I think you fixed the linting in your last PR, if you want to update this one?

tooolbox avatar Feb 26 '22 21:02 tooolbox

Hi @tooolbox this PR (#245 it's owned by @n10ty) the logs to review the link has expired... in #255 there are only the test that now fails ... and later (with #245) works... i had checked in my repository in this branch https://github.com/josejuanmontiel/kala/commits/feature/prevent%2Btest where i cherry-pick the commits of this PR and the test of #255 and lint passed then https://github.com/josejuanmontiel/kala/actions/runs/1613584482 if you want.. and @n10ty don't mind i can increment #255 with #245... but this should pass the lint (sure the error was related with #249 .

josejuanmontiel avatar Mar 01 '22 23:03 josejuanmontiel

@josejuanmontiel

Sorry for the delay.

Sure, sounds good.

tooolbox avatar Jul 24 '22 05:07 tooolbox

@tooolbox take a look if you have time... this new pr as we said

Need extra work to be notice:

  • Upgrade linter (and the new improved suggested by it)
  • Upgrade circleci (some strange problem ... or not... maybe only falling test... but good pass go 1.15 to 1.7)
  • Fixing some race conditions in old test (now its test use more real cache, that run as real)
  • Do and undo other race condition (not needed finally)
  • Undo one lint about header timeout in startup (fail the integration_test)
  • Some TODO about race conditions that could be improved

Hope this finally make this done.

I offer (hope to have some free time for this) to help in the rest of PR

Kala its a good colection of good code, that need some care to be up-today

josejuanmontiel avatar Aug 04 '22 16:08 josejuanmontiel

Handled by #257

tooolbox avatar Sep 22 '22 04:09 tooolbox