osmosis
osmosis copied to clipboard
fix(e2e): flakiness when upgrading containers for `chainB`
Background
e2e has recently had several refactors. The latest change #1999 has likely caused test flakiness.
Info from @czarcas7ic :
- Failing run: https://github.com/osmosis-labs/osmosis/actions/runs/2645267194/attempts/1
- Failing line: https://github.com/osmosis-labs/osmosis/blob/938f9bdb4ce05e178340b63c25452505ff7c6a3d/tests/e2e/configurer/upgrade.go#L250
- It always fails when upgrading containers for
chainB
Info from @p0mvn :
- Impossible to reproduce locally, tried re-running 10 times
- #1999 is the most likely cause
Acceptance Criteria
- investigate and fix e2e test flakiness
- Re-running CI 10 times does not cause an issue (try making redundant changes to trigger)
Added more logs here: https://github.com/p0mvn/osmosis/pull/14
Trying to manually trigger e2e in CI multiple times to repro this
I have 2 updates on this:
-
I was not able to reproduce in CI with extra logs. Tried running 10 times on my fork: https://github.com/p0mvn/osmosis/runs/7311008352?check_suite_focus=true
-
The first update should not be a big problem because #2040 refactors the logic for waiting for a certain height. Instead of using CLI, it now uses Tendermint RPC which should be more reliable and easier to debug. We need this refactor for the next step in state-sync so it might address 2 problems at once.
Recent e2e flakiness: https://github.com/osmosis-labs/osmosis/runs/7351687619?check_suite_focus=true
https://github.com/osmosis-labs/osmosis/pull/2078
Another recent instance: https://github.com/osmosis-labs/osmosis/runs/7373637913?check_suite_focus=true
This was at initialization though
Another instance: https://github.com/osmosis-labs/osmosis/runs/7458951335?check_suite_focus=true
Had a chat with @nikever about this. The plan is to get something going with the self-hosted runners and retain container logs to improve the debugging experience
Awesome, self hosted runners are for sure a game changer
Probably fixed by #2556 , am going to close this for now and if we see another instance of this happening we can reopen and look into this deeper