zebra icon indicating copy to clipboard operation
zebra copied to clipboard

feat(test): Check that the state service handles errors correctly when committing finalized blocks

Open upbqdn opened this issue 1 year ago • 1 comments

Motivation

We recently refactored the process of committing blocks to the finalized state. This PR tests if Zebra handles errors correctly.

Solution

In this PR, we trigger a failure when committing a block, and make sure the state service handles it as expected.

Reviewer Checklist

  • [ ] Will the PR name make sense to users?
    • [ ] Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • [ ] Are the PR labels correct?
  • [ ] Does the code do what the ticket and PR says?
  • [ ] How do you know it works? Does it have tests?

Follow Up Work

We might need to do a similar test for the non-finalized state as well.

upbqdn avatar Oct 11 '22 00:10 upbqdn

The new test works well locally.

upbqdn avatar Oct 13 '22 23:10 upbqdn

The new test works well locally.

The new test also works ok on my machine.

But I discovered that it hangs when there is only 1 CPU available. Which should be OK, some tests require multiple cores.

But maybe it also hangs in CI when there are only 2 virtual CPU cores available? That might be a timing issue.

Here's how I tested it with just one core:

$ taskset 0x01 cargo test -- state_handles_error_correctly

I'm really not sure how to make progress on this one. We could update the ticket with the progress we made, and close the PR?

(Sorry, I realise this has happened a bit to your work lately, I wish I could be more helpful!)

teor2345 avatar Oct 19 '22 06:10 teor2345

We could also disable the test by default in CI using #[ignore], and merge it?

Then it will run locally when people ask for it, but it won't break CI or other weird test environments.

teor2345 avatar Oct 19 '22 06:10 teor2345

But maybe it also hangs in CI when there are only 2 virtual CPU cores available? That might be a timing issue.

I don't think it's a timing issue because it consistently it hangs in CI and passes locally.

We could create an issue for when larger runners become available, they're currently in public beta for enterprise: https://github.blog/changelog/2022-09-01-github-actions-larger-runners-are-now-in-public-beta

arya2 avatar Oct 19 '22 18:10 arya2

We could add some logging to each line in each thread, and see there it hangs in CI? But that could be a lot of work for a low-priority test.

@mpguerra what do you think? Should we work on this sync test more, or just close the PR and update the ticket with this draft branch?

teor2345 avatar Oct 19 '22 19:10 teor2345

@mpguerra what do you think? Should we work on this sync test more, or just close the PR and update the ticket with this draft branch?

Let's just close the PR and update with draft branch. Sorry @upbqdn :(

mpguerra avatar Oct 20 '22 09:10 mpguerra

I discovered that it hangs when there is only 1 CPU available.

You're right, running

taskset 0x1 cargo test -- state_handles_error_correctly

dead-locks since there' no CPU activity. However, running the test on two CPUs like this

taskset 0x3 cargo test -- state_handles_error_correctly

works fine. I was trying to find out where the dead-lock happens, but I didn't manage to figure it out.

upbqdn avatar Oct 20 '22 15:10 upbqdn

Let's just close the PR and update with draft branch.

Sure, I also referenced this PR in ticket #2654 so that it's easy to find the description and the comments. I also updated the description of this PR.

upbqdn avatar Oct 20 '22 15:10 upbqdn