janusgraph icon indicating copy to clipboard operation
janusgraph copied to clipboard

fix: handle BerkeleyJE DB interruption [tp-tests]

Open tien opened this issue 1 year ago • 22 comments

Currently, any interruption to BerkeleyJE DB will cause Janusgraph to stop working entirely & require a complete restart of Janusgraph. This PR will make it so interruption to BerkeleyJE DB will close & reopen the Berkeley's environment instead.

Fixes #2120

tien avatar Apr 29 '24 14:04 tien

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: tien / name: Tiến Nguyễn Khắc (4b86cbdde18ba02f197a1a08149f3272344cd3bf)

@porunov @li-boxuan struggling on how to test this, the test that I copied from #3990 pass but when I check it actually doesn't do anything.

tien avatar Apr 29 '24 16:04 tien

Update: I was able to write a test to consistently trigger the error n prove that my change should work.

tien avatar Apr 30 '24 10:04 tien

Have made a change to only re/initialize on interruption.

tien avatar Apr 30 '24 14:04 tien

Oops, forgot to enable my test since it works now, have done so.

tien avatar Apr 30 '24 23:04 tien

@li-boxuan @porunov all tests passed, can you guys take a look 👀? Keen to get this merged soon 💪.

tien avatar May 01 '24 14:05 tien

I am sorry but I seem to have missed some context. Do you know why @mad 's previous PR didn't pass the test, and what did you do differently to solve the problem?

li-boxuan avatar May 02 '24 03:05 li-boxuan

I am sorry but I seem to have missed some context. Do you know why @mad 's previous PR didn't pass the test, and what did you do differently to solve the problem?

@li-boxuan I'm not sure why that one didn't pass the test. I'm using a completely different approach in this PR that involves a lot less changes.

tien avatar May 02 '24 04:05 tien

Thanks for your contribution!

For proper testing, you need to disable test filtering here, here and here

To be completely correct, you still need to enabld this (supportsInterruption)

mad avatar May 02 '24 09:05 mad

Thank you @mad, I have committed the suggested changes.

tien avatar May 02 '24 14:05 tien

Thanks for your contribution!

For proper testing, you need to disable test filtering here, here and here

To be completely correct, you still need to enabld this (supportsInterruption)

@mad Look like doing this will actually fail the tests.

I think it's because BerkeleyJE DB technically still does not support interruption, as in it will throw an error upon being interrupted. We are only reopening the environment here so that future requests won't all fail.

tien avatar May 02 '24 22:05 tien

@mad @li-boxuan I have implemented the interruption retry logic.

tien avatar May 02 '24 22:05 tien

tp-test are failing, is there anyway I can run n debug these locally?

tien avatar May 03 '24 03:05 tien

is there anyway I can run n debug these locally?

I never figured out a great way to debug tp-tests. Others might be able to help. The way I do is to find the corresponding test suite in tinkerpop repository, and run all the tests from there. An adhoc way is to copy that test suite class (.java) file to janusgraph repo on your local machine, and run tests from there. This way, you can remove other unrelated test classes...

li-boxuan avatar May 03 '24 04:05 li-boxuan

I've integrated some of the other changes from @mad, so the interruption test suite should pass now. cc @mad @li-boxuan

tien avatar May 03 '24 22:05 tien

Hmm, look like there was one last flaky test in the interruption test.

This test parameter in the TraversalInterruptionTest & TraversalInterruptionComputerTest is failing sometime locally for me, which look like it did fail on the CI:

{"g_E_properties", (Function<GraphTraversalSource, GraphTraversal<?,?>>) g -> g.E(), (UnaryOperator<GraphTraversal<?,?>>) t -> t.properties()}

Will investigate, getting close, just 1 case left that need to pass 💪.

tien avatar May 04 '24 06:05 tien

Okay, I've made further changes, after which I haven't been able to reproduce any failure when running the test locally. Hopefully that means the flakiness has gone away 😅.

tien avatar May 04 '24 10:05 tien

Oops, forgot to check other tests beside interruption, made I mistake, have pushed a fix.

tien avatar May 04 '24 11:05 tien

Have made the best effort implementation, so BerkeleyJE DB environment will be reset after interruption & also that race condition around thread are covered so that the interruption test from tp-test will always works.

tien avatar May 05 '24 10:05 tien

@li-boxuan @porunov all tests passed :D

There's the coverage check, but I don't think I can get it any higher though, cuz a lot of the uncovered code are catch clauses. Which are impossible to trigger consistently.

tien avatar May 05 '24 21:05 tien

Codecov sometimes seems buggy and we don't know why. Don't worry about that. Congrats on passing all tests! @mad would you be able to take another look?

li-boxuan avatar May 07 '24 03:05 li-boxuan

Awesome, thanks for all the helps and supports guys 🙏

tien avatar May 07 '24 08:05 tien

I am gonna merge this following lazy merge consensus. Thank you @tien for fixing this!

li-boxuan avatar May 14 '24 07:05 li-boxuan

💚 All backports created successfully

Status Branch Result
v1.0

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

janusgraph-automations avatar May 14 '24 07:05 janusgraph-automations

After merging this PR tp-tests are not failing in master branch for berkleyje.

Error:  Errors: 
Error:  org.apache.tinkerpop.gremlin.process.traversal.TraversalInterruptionTest.shouldRespectThreadInterruptionInVertexStep[expectInterruption(g_V_outE)]
[INFO]   Run 1: PASS
Error:    Run 2: TraversalInterruptionTest>AbstractGremlinTest.tearDown:155 » LockTimeout (JE 18.3.12) Lock expired. Locker 1229014061 2150_main_Txn: waited for lock on database=_jeNameMap LockAddr:617727493 LSN=0x0/0x1167 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1715693374163 endTime=1715693374663

I will try to re-run those tests again. If they don't pass we should revert this commit.

porunov avatar May 14 '24 18:05 porunov

After merging this PR tp-tests are not failing in master branch for berkleyje.

Error:  Errors: 
Error:  org.apache.tinkerpop.gremlin.process.traversal.TraversalInterruptionTest.shouldRespectThreadInterruptionInVertexStep[expectInterruption(g_V_outE)]
[INFO]   Run 1: PASS
Error:    Run 2: TraversalInterruptionTest>AbstractGremlinTest.tearDown:155 » LockTimeout (JE 18.3.12) Lock expired. Locker 1229014061 2150_main_Txn: waited for lock on database=_jeNameMap LockAddr:617727493 LSN=0x0/0x1167 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1715693374163 endTime=1715693374663

I will try to re-run those tests again. If they don't pass we should revert this commit.

I'm pretty sure this is just a flaky error due to Berkeley DB can't handle many concurrent reads/writes, which now happen cuz we no longer exclude it from some of the tests.

Like I can see that this pass https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965523610 But this run got the time-out error https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965524860

Look like a whole other issue: #1623

tien avatar May 15 '24 00:05 tien

After merging this PR tp-tests are not failing in master branch for berkleyje.

Error:  Errors: 
Error:  org.apache.tinkerpop.gremlin.process.traversal.TraversalInterruptionTest.shouldRespectThreadInterruptionInVertexStep[expectInterruption(g_V_outE)]
[INFO]   Run 1: PASS
Error:    Run 2: TraversalInterruptionTest>AbstractGremlinTest.tearDown:155 » LockTimeout (JE 18.3.12) Lock expired. Locker 1229014061 2150_main_Txn: waited for lock on database=_jeNameMap LockAddr:617727493 LSN=0x0/0x1167 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1715693374163 endTime=1715693374663

I will try to re-run those tests again. If they don't pass we should revert this commit.

I'm pretty sure this is just a flaky error due to Berkeley DB can't handle many concurrent reads/writes, which now happen cuz we no longer exclude it from some of the tests.

Like I can see that this pass https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965523610 But this run got the time-out error https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965524860

Look like a whole other issue: #1623

Yes, tests for berkeleyje with Java 8 passed after restart, but failed again for Java 11. I restarted the job again for Java 11. We should probably increase those timeouts as I see timeoutMillis=500 which is probably isn't enough for berkeleyje in GitHub Actions sometimes.

porunov avatar May 15 '24 00:05 porunov

@porunov I can see that it failed again :(

It's like a flaky race condition where the test try to create an entirely new Berkeley DB graph when the previous one is still invalid/re-intializing (Probably due to the lock issue). This isn't handled by this PR: the BerkeleyDB store doesn't have control over other instances created before or after it.

This won't happen in production because you'll most likely & should have only one Janusgraph with Berkeley DB instance up and running.

Maybe we should just continue what we did before and disable this particular test for BerkeleyDB?

Or push on with the flaky test, It has always pass for me when I run this locally :p

tien avatar May 15 '24 02:05 tien

Though like you said @porunov increasing the lock time-out might just make this problem go away :p

tien avatar May 15 '24 05:05 tien