matrix-rust-sdk icon indicating copy to clipboard operation
matrix-rust-sdk copied to clipboard

NSE test is ignored (because it's flaky)

Open andybalaam opened this issue 1 year ago • 6 comments

We disabled the NSE test in https://github.com/matrix-org/matrix-rust-sdk/pull/3409 because it was freezing up and timing out.

This issue is to cover the work to de-flake and re-enable it.

andybalaam avatar May 14 '24 11:05 andybalaam

I ran it 100 times and the 48th time it froze after:

alice_main has created and enabled encryption in the room

and before:

alice_main and bob are both aware of each other in the e2ee room

Trying with more logging.

andybalaam avatar May 14 '24 11:05 andybalaam

It appears to freeze at this line:

assert!(bob.room_is_encrypted(&room_id).await);

andybalaam avatar May 14 '24 11:05 andybalaam

It seems sync_until in that same file tries to implement async timeouts, but incorrectly does so (if f() never returns, then it's never timing out, since this is actively polling on it). One first step could be to use the tokio::time::timeout() method there instead, which will make the test fail with an actual timeout error.

bnjbvr avatar May 14 '24 12:05 bnjbvr

I implemented a timeout using tokio::time::timeout() as described, but I still got a freeze after:

alice_nse received msg1 from bob

I can't tell where it's freezing. I guess more debug logging is needed.

andybalaam avatar May 14 '24 14:05 andybalaam

I added tokio::time::timeout blocks around each step of the test, and now the test is consistently passing for me (I ran it 200 times).

andybalaam avatar May 15 '24 07:05 andybalaam

I'm afraid I have run out of time to try and fix this, so I have to give up :-(

I don't see any obvious problems with the test code - it instantiates 3 clients, one of which is a NotificationClient clone of another, adds the users to an encrypted room, then sends messages and expects them to be received.

It freezes up about 1 time in 100 (on my machine), and I don't know how to debug further. Adding timeout blocks prevents the freezes, and seems to make the problem less likely to happen, but I did see a timeout trigger on my machine just now.

We have a complement-crypto test for the bug this is testing, which will be run on every CI run, so that offers some coverage.

I would dearly love to figure out why this is not working, but I've just run out of time.

andybalaam avatar May 16 '24 12:05 andybalaam