go-dqlite
go-dqlite copied to clipboard
Tests remain sensitive to timing when on slower hardware
While working on the Debian packaging for this library, I've found that several of the different builds for various architectures fail semi-randomly because tests fail due to timing issues. The recent pull request https://github.com/canonical/go-dqlite/pull/167 makes things much better, but there are still occasional failures when building/running on "slower" hardware like an arm box.
A good way of testing that I've found to reliably expose this issue is to build the library and run its tests on a RaspberryPi 3B that I have available locally (arm64, running Debian bullseye off a micro-SD card). Building v1.10.1 plus that cherry-picked pull request, I will typically see one or maybe two test failures -- it's not always the same test that fails, nor do they seem to fail with equal probability. I haven't taken rigorous notes on the failing tests, but some of the more frequent ones are:
TestHandover_TransferLeadership
TestRolesAdjustment_ReplaceVoter
TestRolesAdjustment_ReplaceVoterHonorFailureDomain
TestRolesAdjustment_ReplaceVoterHonorWeight
TestRolesAdjustment_ReplaceStandByHonorFailureDomains
If there's other information that I can provide to help resolve this issue, just let me know!
I would expect https://github.com/canonical/go-dqlite/pull/170 and https://github.com/canonical/go-dqlite/pull/168 to help a lot in that case too, especially the first one.
#170 does indeed help, although I still get random test failures on my RaspberryPi. I ran 20 builds using sbuild (so each run is in a clean, fresh environment), and only 2 runs passed all tests. The others all had at least one test failure. On my normal build server (amd64), there's absolutely no issues with the tests passing run after run.
I think it has to do with the tls implementation in older go versions, can you (if you have time) experiment with go version 1.17 and see if you see the same behaviour? I'm not quite sure how to go forward, maybe I won't use tls on armhf in the tests and accept it's going to be slow, or search for a faster tls library.
This weekend I built v1.10.2 of this library on an arm64 system (RaspberryPi 3B), running Debian unstable and golang v1.17.5. I performed 25 builds, using sbuild, and observed the following tests failing. (None of the builds had all tests pass.)
- TestClient_Dump (11x)
- TestClient_Transfer (9x)
- TestClient_Transfer (1x)
- TestHandover_TransferLeadership (2x)
- TestIntegration_ExecBindError (2x)
- TestIntegration_LeadershipTransfer (2x)
- TestMembership (9x)
- TestNew_ClusteredKvReadWrite (11x)
- TestNew_ClusteredTimeout (11x)
- TestNew_Default (1x)
- TestNew_KvReadWrite (1x)
- TestProtocol_RequestWithDynamicBuffer (21x)
- TestRolesAdjustment_ReplaceVoterHonorFailureDomain (1x)
- TestRolesAdjustment_ReplaceVoterHonorWeight (1x)
Assign