lightning
lightning copied to clipboard
ci: Flaky tests are slowing down the process of merging
It's time to name them and shame them, as the work of the maintainer, or release captain is suffering from having to restart test runs over and over again:
| Test Name | Runs | Failures | Flakyness | ||
|---|---|---|---|---|---|
| test_onchain_their_unilateral_out[True] | 143 | 57 | 28.50% | ||
| test_wss_proxy | 164 | 47 | 22.27% | ||
| test_rbf_reconnect_tx_construct | 25 | 6 | 19.35% | ||
| test_penalty_htlc_tx_timeout[True] | 120 | 25 | 17.24% | ||
| test_penalty_htlc_tx_fulfill[True] | 123 | 25 | 16.89% | ||
| test_penalty_outhtlc[True] | 141 | 25 | 15.06% | ||
| test_penalty_rbf_normal[True] | 142 | 25 | 14.97% | ||
| test_penalty_inhtlc[True] | 147 | 25 | 14.53% | ||
| test_onchain_middleman_their_unilateral_in[True] | 150 | 25 | 14.29% | ||
| test_onchain_timeout[True] | 150 | 25 | 14.29% | ||
| test_onchain_middleman_simple[True] | 152 | 25 | 14.12% | ||
| test_anchorspend_using_to_remote[True] | 142 | 22 | 13.41% |
If your test is listed here, please go and stabilize it, the maintainers will be thankful. If not we might have to disable the tests temporarily until they are more stable.
Also there are still tests failing because of port binding:
Failed to bind socket for 127.0.0.1:44971: Address already in use
unfortunately every fix i tried made it worse
So for the port-binding we usually use the ephemeral_port_reserve package (or similar) that pre-binds a random port, and we can then take control of it by using the SO_REUSEADDR (allowing us to grab the port despite it being in a WAIT state). This means we can use the OS to distribute unique ports. We usually run into port conflicts when not using the reserve method of the package and just use a random port (without giving the OS a chance to tell us that it is already being used). Another case is when we let too much time elapse between reservation and binding (as the port loses the WAIT status and may be re-assigned), so reserve briefly before using, and reserve a new port if you can't ensure short downtime on that bind.
I think I fixed a test_penalty_htlc_tx_timeout[True] flake in https://github.com/ElementsProject/lightning/pull/7364...
Excellent, looks quite good. Let's merge and I can update the table (strike the entry through) if we can see it settle in :+1:
Here are the current outcomes for the branch of #7364:
select count(*), outcome from testruns where github_ref LIKE '%7364%' AND testname LIKE '%penalty_htlc_tx_time%' group by outcome;
22|2 # Success
2|3 # Failure
4|4 # Skipped
So it looks like it is partially fixed (~10% failure rate compared to the above 17%), but not stable.
A quick update on the stats, some got better some got worse:
| Test Name | Runs | Failures | Flakyness |
|---|---|---|---|
| test_rbf_reconnect_tx_construct | 10 | 5 | 33.33% |
| test_grpc_connect_notification | 63 | 26 | 29.21% |
| test_wss_proxy | 68 | 22 | 24.44% |
| test_anchorspend_using_to_remote[True] | 54 | 17 | 23.94% |
| test_onchain_their_unilateral_out[True] | 60 | 12 | 16.67% |
| test_penalty_htlc_tx_fulfill[True] | 50 | 10 | 16.67% |
| test_penalty_htlc_tx_timeout[True] | 50 | 10 | 16.67% |
| test_penalty_outhtlc[True] | 59 | 10 | 14.49% |
| test_penalty_rbf_normal[True] | 59 | 10 | 14.49% |
| test_onchain_middleman_simple[True] | 60 | 10 | 14.29% |
| test_onchain_middleman_their_unilateral_in[True] | 60 | 10 | 14.29% |
| test_onchain_timeout[True] | 60 | 10 | 14.29% |
| test_penalty_inhtlc[True] | 60 | 10 | 14.29% |
Notice that we are not doing as many tests, likely because of the release feature freeze last month.