lightning ci: Flaky tests are slowing down the process of merging

It's time to name them and shame them, as the work of the maintainer, or release captain is suffering from having to restart test runs over and over again:

Test Name	Runs	Failures	Flakyness
test_onchain_their_unilateral_out[True]	143	57	28.50%
test_wss_proxy	164	47	22.27%
test_rbf_reconnect_tx_construct	25	6	19.35%
test_penalty_htlc_tx_timeout[True]	120	25	17.24%
test_penalty_htlc_tx_fulfill[True]	123	25	16.89%
test_penalty_outhtlc[True]	141	25	15.06%
test_penalty_rbf_normal[True]	142	25	14.97%
test_penalty_inhtlc[True]	147	25	14.53%
test_onchain_middleman_their_unilateral_in[True]	150	25	14.29%
test_onchain_timeout[True]	150	25	14.29%
test_onchain_middleman_simple[True]	152	25	14.12%
test_anchorspend_using_to_remote[True]	142	22	13.41%

If your test is listed here, please go and stabilize it, the maintainers will be thankful. If not we might have to disable the tests temporarily until they are more stable.

May 14 '24 10:05 cdecker

Also there are still tests failing because of port binding: Failed to bind socket for 127.0.0.1:44971: Address already in use unfortunately every fix i tried made it worse

May 15 '24 12:05 daywalker90

So for the port-binding we usually use the ephemeral_port_reserve package (or similar) that pre-binds a random port, and we can then take control of it by using the SO_REUSEADDR (allowing us to grab the port despite it being in a WAIT state). This means we can use the OS to distribute unique ports. We usually run into port conflicts when not using the reserve method of the package and just use a random port (without giving the OS a chance to tell us that it is already being used). Another case is when we let too much time elapse between reservation and binding (as the port loses the WAIT status and may be re-assigned), so reserve briefly before using, and reserve a new port if you can't ensure short downtime on that bind.

May 15 '24 14:05 cdecker

I think I fixed a test_penalty_htlc_tx_timeout[True] flake in https://github.com/ElementsProject/lightning/pull/7364...

Jun 04 '24 07:06 rustyrussell

Excellent, looks quite good. Let's merge and I can update the table (strike the entry through) if we can see it settle in :+1:

Jun 17 '24 10:06 cdecker

Here are the current outcomes for the branch of #7364:

select count(*), outcome from testruns where github_ref LIKE '%7364%' AND testname LIKE '%penalty_htlc_tx_time%' group by outcome;
22|2 # Success
2|3 # Failure
4|4 # Skipped

So it looks like it is partially fixed (~10% failure rate compared to the above 17%), but not stable.

Jun 17 '24 11:06 cdecker

A quick update on the stats, some got better some got worse:

Test Name	Runs	Failures	Flakyness
test_rbf_reconnect_tx_construct	10	5	33.33%
test_grpc_connect_notification	63	26	29.21%
test_wss_proxy	68	22	24.44%
test_anchorspend_using_to_remote[True]	54	17	23.94%
test_onchain_their_unilateral_out[True]	60	12	16.67%
test_penalty_htlc_tx_fulfill[True]	50	10	16.67%
test_penalty_htlc_tx_timeout[True]	50	10	16.67%
test_penalty_outhtlc[True]	59	10	14.49%
test_penalty_rbf_normal[True]	59	10	14.49%
test_onchain_middleman_simple[True]	60	10	14.29%
test_onchain_middleman_their_unilateral_in[True]	60	10	14.29%
test_onchain_timeout[True]	60	10	14.29%
test_penalty_inhtlc[True]	60	10	14.29%

Notice that we are not doing as many tests, likely because of the release feature freeze last month.

Jun 17 '24 13:06 cdecker

lightning lightning copied to clipboard

ci: Flaky tests are slowing down the process of merging

lightning
lightning copied to clipboard