kala icon indicating copy to clipboard operation
kala copied to clipboard

Prevent deleted job to run (with test)

Open josejuanmontiel opened this issue 2 years ago • 2 comments

Following with comments in https://github.com/ajvb/kala/pull/245 this PR has cherry-pick the code and adding test (and now with https://github.com/ajvb/kala/pull/249 merged) should pass (the lint)

josejuanmontiel avatar Aug 02 '22 14:08 josejuanmontiel

Need extra work to be notice:

  • Cherry-pick https://github.com/ajvb/kala/pull/245 in this PR
  • Upgrade linter (and the new improved suggested by it)
  • Upgrade circleci (some strange problem ... or not... maybe only falling test... but good pass go 1.15 to 1.7)
  • Fixing some race conditions in old test (now its test use more real cache, that run as real)
  • Do and undo other race condition (not needed finally)
  • Undo one lint about header timeout in startup (fail the integration_test)
  • Some TODO about race conditions that could be improved

josejuanmontiel avatar Aug 04 '22 16:08 josejuanmontiel

Thanks @josejuanmontiel I will take a look.

tooolbox avatar Aug 10 '22 23:08 tooolbox

Hi @tooolbox the test (with race detection) sometimes detect another (TestRemoteJobBadStatus need a lock on final test too) other doesn't... check this https://app.circleci.com/pipelines/github/josejuanmontiel/kala?branch=feature%2Fprevent%2Btest same commit c860372 , first fail... sencond (31) works...

About "delete" i think we can talk about now...

but i'm going to start to play (https://github.com/equilibristofgo/sandbox/tree/feat/mutex_example/05_race_condition) with an aprox that do the things without need so much locks... i'll try for fun... PR coming soon :)

josejuanmontiel avatar Sep 05 '22 22:09 josejuanmontiel

Hi @tooolbox the test (with race detection) sometimes detect another (TestRemoteJobBadStatus need a lock on final test too) other doesn't... check this https://app.circleci.com/pipelines/github/josejuanmontiel/kala?branch=feature%2Fprevent%2Btest same commit https://github.com/ajvb/kala/commit/c8603729e03d14c72b334d2d0756fe3848453c64 , first fail... sencond (31) works...

Okay. What I understand is that TestRemoteJobBadStatus is racy because the same commit failed & succeeded on different runs, with no changes.

Seems like the easy fix is to make https://github.com/josejuanmontiel/kala/blob/c8603729e03d14c72b334d2d0756fe3848453c64/job/job_test.go#L1103 needs look like https://github.com/josejuanmontiel/kala/blob/c8603729e03d14c72b334d2d0756fe3848453c64/job/job_test.go#L1083 (correct me if I'm wrong).

About "delete" i think we can talk about now...

What is needed on "delete"?

but i'm going to start to play (https://github.com/equilibristofgo/sandbox/tree/feat/mutex_example/05_race_condition) with an aprox that do the things without need so much locks... i'll try for fun... PR coming soon :)

I see. In the meantime we should proceed with this PR, correct?

At this point I am going to re-review the code of this PR.

tooolbox avatar Sep 10 '22 23:09 tooolbox

Everything looks good to me, barring the global mutex. I think you were going to change it to a per-job mutex, correct?

I think it's not going to be trivial... if you want to try yourself ... i don't want to commit again until solve the problem

(However, if the problem is because of improper/lacking mutex usage in a test, it may be a better idea to fix the test?)

To solve the random ok in test, we need fix the test with lock arround the link where the test check the value... but it's not related with remove global lock.

Otherwise I think we are very close to merging.

I think i'm going to need more time to change the global for other kind/place of lock.

josejuanmontiel avatar Sep 12 '22 21:09 josejuanmontiel

@tooolbox what do you think... i can't do that this three test TestRemoteJobRunner TestRemoteJobBadStatus TestRemoteJobBadStatusSuccess works with the fix of the delete... and avoid race condition (that could be a false positive if you think... the race span a lot of gorutines with the code of the test that couldn't happen in the reality because the start of cache to solve the real delete)

Or global mutex... or place in another file and skip //go:build !race // +build !race ... and later try to refactor in another way.

Later... this night i upload one commit with this aproach.

josejuanmontiel avatar Sep 16 '22 14:09 josejuanmontiel

done... let's talk :) (meanwhile i'll think a way to remove all locks but... for other PR)

josejuanmontiel avatar Sep 16 '22 23:09 josejuanmontiel

Fascinating. So the Remote Job tests were racy, but not because of something within Kala itself.

I mean, it now looks somewhat obvious, but you handled it cleverly.

tooolbox avatar Sep 22 '22 04:09 tooolbox

Merged. Thanks for the excellent work and all of your valuable time :)

tooolbox avatar Sep 22 '22 04:09 tooolbox