Tracking issue: Bad RTT during connection startup

Open flub opened this issue 1 year ago • 0 comments

At the beginning of a connection we have observed it takes a while before we get a reasonable and stable Round-Trip-Time. We need to be a bit better at this:

#2176 is the main issue in which we've been exploring how to improve.
- there is a branch, https://github.com/n0-computer/iroh/tree/flub/messing-around-with-rtt which is exploring how to improve on this. It is concentrating on resetting the congestion-controller state when the network path changes.
#2169 is very related, it results in too many connection type changes.
We should have metrics that show the RTT of a long-lived connection. Due to how metric collection works this would not result in being able to observe this problem at the start of the connection very much. But would still be a worthwhile thing to expose.
We should be able to properly quantify this, ideally integrated into CI perf tests so we do not regress on this. Maybe have an option to keep a window of 30s of RTT values for small (e.g. 200ms) time windows around for connections in a magic socket and exposing this via the doctor and/or the perf binary. We could also make this a sliding window of the last 30s, but this is likely too expensive to keep during every connection of a production setup.
We need a way of ensuring this does not regress in CI

Apr 23 '24 16:04 flub