go-libp2p-pubsub
go-libp2p-pubsub copied to clipboard
Fix flaky tests
Some tests are flaky & do not reliably pass locally or on CI (#141 for example). As discussed with @aschmahmann, the two we know to be flaky are:
- TestGossipsubGossip
- TestGossipsubGossipPiggyback
Please feel free to point out more in the comments.
Also, please can someone assign this issue to me ?
TestGossipsubControlPiggyback I suspect is also flaky given that it is skipped with t.Skip("travis regularly fails on this test").
I can try and do some more looking into this, but I suspect that most of the tests that utilize time.Sleep are flaky to some degree, but most are not currently causing major issues and will therefore be difficult to detect. If these sleeps are actually the problem then replacing them with either the new PubSub events or polling internal data structures (e.g. something like for condition{condition=checkPrivateState(); time.Sleep(100ms)}) should help.
@vyzo @aschmahmann @Stebalien
The first PR makes the flaky test wait for mesh construction & subscription processing rather than sleeping & hoping it has finished. However, that didn't do much.
The tests fail because Travis has a timeout of 10 minutes per test & not all subscribers receive the message before that. I'm not being able to think of how to dig into this issue. What more should we do/investigate for these tests ?
The Travis 10 minute timeout shouldn't be an issue. However, it does look like not all the subscribers are receiving the messages.
My guess is that this is happening because messages are getting dropped here: https://github.com/libp2p/go-libp2p-pubsub/blob/d28851004cfcbccceea98f8eb409c3c01f882097/gossipsub.go#L345-L348
If we can confirm this is true (this can be a simple as throwing a panic instead of logging an error, or adding some internal variable to gossipsub that we can check) that would be great. At that point we can figure out what to do next that might include:
- Diving more deeply into the events that cause the messages to drop (maybe there's a bug somewhere to expose here)
- Changing the tests to test different things. For example, we're testing that a gossipsub network with 20 nodes will faithfully propagate 200 messages if we just slow them down to 2 at a time. However, it's possible that due to the construction of the mesh a single peer could be really slow and end up with a long queue of messages and so some end up dropped. Maybe then we should test with a different mesh topology, expect only some fraction of the messages to propagate, or use some alternative strategy to test the gossip and piggybacking functionalities.
Note: it's interesting if the issue is that our test is bad, because if even the gossipsub authors are accidentally writing tests that make the same incorrect assumptions as in #197 we probably need to be very upfront in our documentation about what will and won't work. Additionally, if it turns out people really want a version of pubsub that has stronger delivery guarantees that's something we may want to consider, as mentioned briefly in libp2p/go-libp2p#721