lightning icon indicating copy to clipboard operation
lightning copied to clipboard

ci: Flaky tests are slowing down the process of merging

Open cdecker opened this issue 1 year ago • 6 comments

It's time to name them and shame them, as the work of the maintainer, or release captain is suffering from having to restart test runs over and over again:

Test Name Runs Failures Flakyness
test_onchain_their_unilateral_out[True] 143 57 28.50%
test_wss_proxy 164 47 22.27%
test_rbf_reconnect_tx_construct 25 6 19.35%
test_penalty_htlc_tx_timeout[True] 120 25 17.24%
test_penalty_htlc_tx_fulfill[True] 123 25 16.89%
test_penalty_outhtlc[True] 141 25 15.06%
test_penalty_rbf_normal[True] 142 25 14.97%
test_penalty_inhtlc[True] 147 25 14.53%
test_onchain_middleman_their_unilateral_in[True] 150 25 14.29%
test_onchain_timeout[True] 150 25 14.29%
test_onchain_middleman_simple[True] 152 25 14.12%
test_anchorspend_using_to_remote[True] 142 22 13.41%

If your test is listed here, please go and stabilize it, the maintainers will be thankful. If not we might have to disable the tests temporarily until they are more stable.

cdecker avatar May 14 '24 10:05 cdecker

Also there are still tests failing because of port binding: Failed to bind socket for 127.0.0.1:44971: Address already in use unfortunately every fix i tried made it worse

daywalker90 avatar May 15 '24 12:05 daywalker90

So for the port-binding we usually use the ephemeral_port_reserve package (or similar) that pre-binds a random port, and we can then take control of it by using the SO_REUSEADDR (allowing us to grab the port despite it being in a WAIT state). This means we can use the OS to distribute unique ports. We usually run into port conflicts when not using the reserve method of the package and just use a random port (without giving the OS a chance to tell us that it is already being used). Another case is when we let too much time elapse between reservation and binding (as the port loses the WAIT status and may be re-assigned), so reserve briefly before using, and reserve a new port if you can't ensure short downtime on that bind.

cdecker avatar May 15 '24 14:05 cdecker

I think I fixed a test_penalty_htlc_tx_timeout[True] flake in https://github.com/ElementsProject/lightning/pull/7364...

rustyrussell avatar Jun 04 '24 07:06 rustyrussell

Excellent, looks quite good. Let's merge and I can update the table (strike the entry through) if we can see it settle in :+1:

cdecker avatar Jun 17 '24 10:06 cdecker

Here are the current outcomes for the branch of #7364:

select count(*), outcome from testruns where github_ref LIKE '%7364%' AND testname LIKE '%penalty_htlc_tx_time%' group by outcome;
22|2 # Success
2|3 # Failure
4|4 # Skipped

So it looks like it is partially fixed (~10% failure rate compared to the above 17%), but not stable.

cdecker avatar Jun 17 '24 11:06 cdecker

A quick update on the stats, some got better some got worse:

Test Name Runs Failures Flakyness
test_rbf_reconnect_tx_construct 10 5 33.33%
test_grpc_connect_notification 63 26 29.21%
test_wss_proxy 68 22 24.44%
test_anchorspend_using_to_remote[True] 54 17 23.94%
test_onchain_their_unilateral_out[True] 60 12 16.67%
test_penalty_htlc_tx_fulfill[True] 50 10 16.67%
test_penalty_htlc_tx_timeout[True] 50 10 16.67%
test_penalty_outhtlc[True] 59 10 14.49%
test_penalty_rbf_normal[True] 59 10 14.49%
test_onchain_middleman_simple[True] 60 10 14.29%
test_onchain_middleman_their_unilateral_in[True] 60 10 14.29%
test_onchain_timeout[True] 60 10 14.29%
test_penalty_inhtlc[True] 60 10 14.29%

Notice that we are not doing as many tests, likely because of the release feature freeze last month.

cdecker avatar Jun 17 '24 13:06 cdecker