human-essentials icon indicating copy to clipboard operation
human-essentials copied to clipboard

Investigate Flakey Specs On GitHub

Open seanmarcia opened this issue 3 years ago • 11 comments

Summary

We have periodic failures of system specs in GitHub that members of the team have been unable to replicate locally. It seems like it is only system tests where it is happening and re-running the tests will generally fix the issue.

Some hypotheses the team has are that it is possibly due to GitHub using a linux environment to run the tests while the team is largely on Macs, another is how GitHub sets up capybara, and another is that we are parallelizing the tests using the knapsack gem.

Things to Consider

Team members used the rspec_n gem to run the flakey failing tests locally 50+ times and were unable to get failures.

We added in the rspec-retry gem to mitigate the problem so, to see the errors on GitHub, you may want to remote the gem and push a branch to see the period failures if they are not happening on a personal machine in the case it is an OS dependent error.

Here are some of the flakey tests from GitHub actions this is incorporating work done as of march 23rd and this list will be updated as successive runs happen on the test branch without rspec-retry on it:

./spec/system/purchase_system_spec.rb:99 ./spec/system/donation_system_spec.rb:233 ./spec/system/donation_system_spec.rb:259 ./spec/system/request_system_spec.rb:140 ./spec/system/request_system_spec.rb:150 rspec ./spec/system/partners/partner_dashboard_system_spec.rb:53

Criteria for Completion

  • [ ] Understand why these (and other) tests are periodically failing
  • [ ] Fix our GitHub actions so we no longer have these failures

seanmarcia avatar Jan 18 '22 16:01 seanmarcia

@seanmarcia have devs tried running these with the same seed that failed on GitHub? I know I've seen at least one flaky spec that seems to have been the result of hardcoding an ID, which would only fail if the specs ran in a particular order.

dorner avatar Jan 30 '22 00:01 dorner

I don't think so. This list is mostly from runs on GitHub actions where someone says something like "The test failures are on tests unrelated to this PR." Then I grabbed the failing tests.

seanmarcia avatar Jan 30 '22 19:01 seanmarcia

@seanmarcia I feel like if we really want to fix these we're going to have to have some way of being notified when the rspec_retry gem encounters a failure. We can see it when it's printed, but that's a lot of logs to search through. There's no guarantee of us finding the error if we push a branch to GitHub.

The way I see it we have two options to have a hope of fixing this:

  1. Disable rspec-retry again and have a more active investigation for this.
  2. Put effort into having a notification sent when the retry happens - rspec-retry allows for callback code when it detects an error. We'd have to e.g. send a Slack notification via Ruby code.

I'd lean more towards number 1 - it's annoying, but either we want to fix these or we don't. :)

dorner avatar Feb 11 '22 20:02 dorner

@seanmarcia what do you think?

dorner avatar Feb 20 '22 00:02 dorner

I'm hesitant to just remove the retry as it gives new contributors a bad signal. I think I'm going to push a branch with the gem removed during the next office hours and then just re-run the test suite 10 or so times and see what, if any, errors pop up on the branch.

seanmarcia avatar Feb 22 '22 04:02 seanmarcia

FWIW, running spec/system/distribution_system_spec.rb:375 (671f3cde) on my dev box today passed on 3rd try:

$ bin/rspec spec/system/distribution_system_spec.rb:375
Run options: include {:locations=>{"./spec/system/distribution_system_spec.rb"=>[375]}}

Randomized with seed 2712

Distributions
  via barcode entry
WARN: Screenshot could not be saved. `page.current_path` is empty.
WARN: Screenshot could not be saved. `page.current_path` is empty.
    allows users to add items via scanning them in by barcode
      
1st Try error in ./spec/system/distribution_system_spec.rb:375:

expected: "50"
     got: ""

(compared using ==)

      
RSpec::Retry: 2nd try ./spec/system/distribution_system_spec.rb:375
      
2nd Try error in ./spec/system/distribution_system_spec.rb:375:

expected: "50"
     got: ""

(compared using ==)

      RSpec::Retry: 3rd try ./spec/system/distribution_system_spec.rb:375

Finished in 20.3 seconds (files took 8.88 seconds to load)
1 example, 0 failures

Randomized with seed 2712

I cleared log/test.log before running; looking at it, I don't see any important differences in responses over the 3 cycles (some formatting for clarity):

Started GET "/db_1/distributions/new" for 127.0.0.1 at 2022-02-28 10:35:17 -0800
[...]
Completed 200 OK in 2169ms (Views: 2070.9ms | ActiveRecord: 63.2ms | Allocations: 1113665)

Started GET "/db_1/barcode_items/find.json?barcode_item[value]=501531936088" for 127.0.0.1 at 2022-02-28 10:35:20 -0800
[...]
Completed 200 OK in 26ms (Views: 0.1ms | ActiveRecord: 16.8ms | Allocations: 9901)

Started GET "/db_1/storage_locations//inventory.json?include_inactive_items=false&include_omitted_items=false" for 127.0.0.1 at 2022-02-28 10:35:20 -0800
[...]
Completed 404 Not Found in 7ms (Views: 1.9ms | ActiveRecord: 0.8ms | Allocations: 2553)


Started GET "/db_2/distributions/new" for 127.0.0.1 at 2022-02-28 10:35:21 -0800
[...]
Completed 200 OK in 52ms (Views: 27.8ms | ActiveRecord: 20.9ms | Allocations: 17727)

Started GET "/db_2/barcode_items/find.json?barcode_item[value]=251366020634" for 127.0.0.1 at 2022-02-28 10:35:21 -0800
[...]
Completed 200 OK in 6ms (Views: 0.1ms | ActiveRecord: 1.1ms | Allocations: 2560)

Started GET "/db_2/storage_locations//inventory.json?include_inactive_items=false&include_omitted_items=false" for 127.0.0.1 at 2022-02-28 10:35:21 -0800
[...]
Completed 404 Not Found in 5ms (Views: 0.4ms | ActiveRecord: 0.8ms | Allocations: 1737)


Started GET "/db_3/distributions/new" for 127.0.0.1 at 2022-02-28 10:35:23 -0800
[...]
Completed 200 OK in 53ms (Views: 29.2ms | ActiveRecord: 20.2ms | Allocations: 17364)

Started GET "/db_3/barcode_items/find.json?barcode_item[value]=604105417551" for 127.0.0.1 at 2022-02-28 10:35:23 -0800
[...]
Completed 200 OK in 9ms (Views: 0.1ms | ActiveRecord: 3.4ms | Allocations: 4740)

Started GET "/db_3/storage_locations//inventory.json?include_inactive_items=false&include_omitted_items=false" for 127.0.0.1 at 2022-02-28 10:35:23 -0800
[...]
Completed 404 Not Found in 6ms (Views: 0.3ms | ActiveRecord: 1.6ms | Allocations: 3603)

The timing on the 3rd (i.e., passing) cycle appears similar to the 2nd—i.e., all 3 requests starting in the same second.

So I'm thinking there's some delay in the client (Capybara/Selenium?) side? 🤷‍♂️

bobmazanec avatar Feb 28 '22 19:02 bobmazanec

But...same failure with increasing the wait time

using_wait_time(3) do
  qty = page.find [...]

  expect(qty). [...]
end

as well as both increasing the wait time (again!) and trying to make the expectation more explicit (maybe get more/different Capybara waiting/retry behavior?):

      using_wait_time(4) do
        expect { Barcode.boop(@existing_barcode.value) }
          .to change { page.find_by_id("distribution_line_items_attributes_0_quantity").value }
          .to @existing_barcode.quantity.to_s
      end
[...]
Failures:

  1) Distributions via barcode entry allows users to add items via scanning them in by barcode
     Failure/Error:
       expect { Barcode.boop(@existing_barcode.value) }
         .to(change { page.find_by_id("distribution_line_items_attributes_0_quantity").value })
         .to { @existing_barcode.quantity.to_s }
     
       expected `page.find_by_id("distribution_line_items_attributes_0_quantity").value` to have changed to "50" , but did not change

🤷‍♂️

bobmazanec avatar Feb 28 '22 19:02 bobmazanec

It seems unlikely that the specs/examples listed above all share some common cause(s)? e.g., especially the request spec Maybe create issues for each "flakey" example—or at least each spec file—lest we have an Issue that can "never" be closed?

bobmazanec avatar Mar 01 '22 15:03 bobmazanec

We have a few different open issues related to flakey specs and I'm wondering if it is an issue not related to the code... or not "just" related to the code. I've noticed other folks I follow on twitter complaining about similar things, ie: https://twitter.com/ryanbigg/status/1496971669113499682 Maybe our issues we've been having are related to web mock or chrome driver but seems like they might be be related to another issue somewhere.

Also wonder if there is a larger thread/discussion related to this somewhere where we can follow for a potential solution.

seanmarcia avatar Mar 23 '22 13:03 seanmarcia

Flakiness of system tests is nothing new. :) Whenever you start messing with "real" browsers you get randomized issues like this unfortunately. Some of it might be due to the CI environment, or some weird timing or communication issues. Honestly I think I'd be happy if we leave the rspec-retry in there but only apply it to system tests.

dorner avatar Mar 23 '22 14:03 dorner

Created a branch, #2836 that has rspec-retry disabled. The idea is to re-run the various test workflows multiple times to give us a chance to see and dig into failures and identify flakiness.

seanmarcia avatar Mar 23 '22 18:03 seanmarcia

I think we can safely close this, the only remaining flaky test is the one with the logo which I'll create a more specific ticket for :)

edwinthinks avatar Sep 27 '22 14:09 edwinthinks