aquadoggo icon indicating copy to clipboard operation
aquadoggo copied to clipboard

Solutions for race conditions in tests (and possibly outside of them too)

Open sandreae opened this issue 2 years ago • 1 comments

Turns out there were at least two places causing tests to sometimes fail, all quite interesting in the end. Although this PR was mainly about getting the tests working consistently, I actually ended up adding a little feature ;-)

Cross test pollution of mutably shared schema provider

  • static SCHEMA_PROVIDER: Lazy<Mutex<Vec<Schema>>> = Lazy::new(|| Mutex::new(Vec::new()));

    We're using the above to allow the GraphQL schema builder access to the current schemas, in particular in create_type_info. This works, but when running multiple tests concurrently I think SCHEMA_PROVIDER is being over-written to in one process before it is read in another.

    We take a MutexGuard in save_static_schemas() here but it is dropped as it goes out of scope and this leaves an opening where another lock can be taken out before it's read in the same test.

  • What we want to do then is keep the lock until we read again, we can return it, keeping it alive longer, but if we keep it too long we simply block our own read, and we can't pass it on because we don't have access to the async_graphql methods.

  • So then maybe we need a yielding mutex like tokio::sync::Mutex which wakes a future when a lock is released? But we can't cos it's async and create_type_info isn't.

SOLUTION: We can force the problematic tests to run in series: https://docs.rs/serial_test/0.4.0/serial_test/ :star:

Waiting for stuff to happen

  • it takes a little while for documents to be materialised into the db
  • it takes a little while more for new schema to be registered in the GraphQL server
  • in both these cases we were adding a small sleep in order to wait for the stuff to happen, but this isn't an ideal solution as it slows down the tests, and "how long is long enough"

SOLUTION: a new state messaging channel for all services :star:

closes: #247

📋 Checklist

  • [x] Add tests that cover your changes
  • [x] Add this PR to the Unreleased section in CHANGELOG.md
  • [x] Link this PR to any issues it closes
  • [x] New files contain a SPDX license header

sandreae avatar Nov 28 '22 13:11 sandreae

Codecov Report

Base: 93.95% // Head: 94.17% // Increases project coverage by +0.21% :tada:

Coverage data is based on head (6f069f5) compared to base (670281b). Patch coverage: 97.91% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #266      +/-   ##
==========================================
+ Coverage   93.95%   94.17%   +0.21%     
==========================================
  Files          59       59              
  Lines        4928     5027      +99     
==========================================
+ Hits         4630     4734     +104     
+ Misses        298      293       -5     
Impacted Files Coverage Δ
aquadoggo/src/node.rs 85.18% <66.66%> (+0.56%) :arrow_up:
aquadoggo/src/materializer/service.rs 95.41% <96.22%> (+0.52%) :arrow_up:
aquadoggo/src/graphql/client/dynamic_query.rs 91.01% <100.00%> (+0.14%) :arrow_up:
...quadoggo/src/graphql/client/dynamic_types/tests.rs 100.00% <100.00%> (ø)
aquadoggo/src/graphql/client/mutation.rs 100.00% <100.00%> (ø)
aquadoggo/src/graphql/client/static_query.rs 100.00% <100.00%> (ø)
aquadoggo/src/graphql/client/tests.rs 100.00% <100.00%> (ø)
aquadoggo/src/graphql/schema.rs 92.95% <100.00%> (+0.20%) :arrow_up:
aquadoggo/src/http/service.rs 95.34% <100.00%> (+0.47%) :arrow_up:
aquadoggo/src/manager.rs 96.00% <100.00%> (+0.16%) :arrow_up:
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Nov 30 '22 00:11 codecov[bot]

@sandreae, after some experimentation I thought it might be a good idea to keep this PR open for now and rather only merge the parts which fix the race conditions. You can find it here: https://github.com/p2panda/aquadoggo/pull/269 - What do you think? Is this a good idea?

I think it will make us get to what we want faster: Fix the annoying race conditions. The newly introduced feature very helpful and definitely more elegant, but is maybe not "solved" yet as we're both not happy with its API design. It doesn't give us much functionality outside of tests but already affects the API for other non-tests parts of the stack, making them harder to grasp. I'd rather merge the important bit for now and keep this open until we find something pretty!

Leaving your feature changes unmerged still keeps the second class of race conditions unfixed, but we can at least "control" them with bumping the waiting times. Also not beautiful, but it would come with no big changes for now.

We can keep this here open and make a new ticket which deals with introducing that feature? If you're happy with this I open #269 for review and create the ticket.

In any case, thanks so much for digging into all of that! What a ride!

adzialocha avatar Jan 24 '23 18:01 adzialocha

In this and the new branch I sadly had still the following failing test (as I've mentioned on the chat):

thread 'db::stores::entry::tests::paginated_log_entries::case_6' panicked at 'Send to store: Custom("Entry's claimed seq num of 1 does not match expected seq num of 23 for given public key and log")', /home/adz/.cargo/registry/src/github.com-1ecc6299db9ec823/p2panda-rs-0.6.0/src/test_utils/db/test_db.rs:246:26

adzialocha avatar Jan 24 '23 18:01 adzialocha

@sandreae, after some experimentation I thought it might be a good idea to keep this PR open for now and rather only merge the parts which fix the race conditions. You can find it here: #269 - What do you think? Is this a good idea?

I think it will make us get to what we want faster: Fix the annoying race conditions. The newly introduced feature very helpful and definitely more elegant, but is maybe not "solved" yet as we're both not happy with its API design. It doesn't give us much functionality outside of tests but already affects the API for other non-tests parts of the stack, making them harder to grasp. I'd rather merge the important bit for now and keep this open until we find something pretty!

Leaving your feature changes unmerged still keeps the second class of race conditions unfixed, but we can at least "control" them with bumping the waiting times. Also not beautiful, but it would come with no big changes for now.

We can keep this here open and make a new ticket which deals with introducing that feature? If you're happy with this I open #269 for review and create the ticket.

In any case, thanks so much for digging into all of that! What a ride!

Hey! Thanks for looking into this. I think what you're suggesting is a good move, it's a shame we might not use a lot of the code I did here, but maybe it was unavoidable, in any case it's good research which will feed into the next design phase!

sandreae avatar Jan 25 '23 10:01 sandreae

In this and the new branch I sadly had still the following failing test (as I've mentioned on the chat):

thread 'db::stores::entry::tests::paginated_log_entries::case_6' panicked at 'Send to store: Custom("Entry's claimed seq num of 1 does not match expected seq num of 23 for given public key and log")', /home/adz/.cargo/registry/src/github.com-1ecc6299db9ec823/p2panda-rs-0.6.0/src/test_utils/db/test_db.rs:246:26

Hmm... interesting.... let me see if I can reproduce it.

sandreae avatar Jan 25 '23 10:01 sandreae

In this and the new branch I sadly had still the following failing test (as I've mentioned on the chat):

thread 'db::stores::entry::tests::paginated_log_entries::case_6' panicked at 'Send to store: Custom("Entry's claimed seq num of 1 does not match expected seq num of 23 for given public key and log")', /home/adz/.cargo/registry/src/github.com-1ecc6299db9ec823/p2panda-rs-0.6.0/src/test_utils/db/test_db.rs:246:26

Hmm... interesting.... let me see if I can reproduce it.

Tried reproducing this but so far never seen a fail, ran it 20x or so.... :disappointed:

sandreae avatar Jan 25 '23 11:01 sandreae

Let's move this conversation to the new PR though :+1:

sandreae avatar Jan 25 '23 11:01 sandreae

I've created two new issues to tackle the separate topics:

  • https://github.com/p2panda/aquadoggo/issues/271
  • https://github.com/p2panda/aquadoggo/issues/270

adzialocha avatar Jan 25 '23 11:01 adzialocha

... it's a shame we might not use a lot of the code I did here

This is definitely not going to waste, think we got some really good learnings from this which will help us to keep on working on it :+1:

adzialocha avatar Jan 25 '23 11:01 adzialocha

I've closed this PR for now (keeping the branch intact if we still need it). One reason is that it is stale, the other is that I think we can tackle this issue from another perspective: Rather than changing our code for better tests we can introduce new test utilities.

My suggestion would be to have a "timeout helper" which checks for a condition every x milliseconds and times out if that condition wasn't reached after y seconds. Like that we can test async conditions to arrive eventually. I've seen that pattern in other projects as well and think that this is the way to go for at least test-related problems.

adzialocha avatar Jun 20 '23 11:06 adzialocha