Expose stats for slow start phase
For #2858 it would make sense to introduce some metrics for the slow start phase that can be tracked in Firefox telemetry.
I'm thinking of the following:
slow_start_exit_ratio: ratio of connections that exit slow start at all- algorithm improvements kinda only apply to those, because the exit point is the biggest differentiator between algos
- I think this is also generally useful to inform impact of changes to congestion avoidance
slow_start_exit_reason: loss vs non-loss- we want to avoid exit by loss, though for now 100% of exits will be due to loss because that's the only thing we implement right now
- modern slow start algos like HyStart++ or SEARCH will change that
slow_start_distance_to_ideal_exit: measure thessthreshwhen exiting slow start and compare to thecwndvalue right before the first congestion event after slow start- this is working on the assumption that an ideal slow start exit would be without packet loss, but as close to the actual congestion point as possible, i.e. spending as much time in slow start as possible without inducing loss
- being very close means we exited close the the potential congestion point, i.e. don't undershoot
- this is kind of best effort, because of course the first congestion event after slow start isn't necessarily close to the actual optimal exit point, but I think it is the best indicator we have for slow start efficiency in metrics
- I very much welcome alternative design approaches for this metric specifically
I'll try to write a patch soon. I think it would make sense if this lands closely after #3086 so we can get a bunch of stats into Firefox at once without having to do multiple releases.
I think it would make sense if this lands closely after #3086 so we can get a bunch of stats into Firefox at once without having to do multiple releases.
Whatever works best for you. Doing many small releases is fine as well.
this is kind of best effort, because of course the first congestion event after slow start isn't necessarily close to the actual optimal exit point, but I think it is the best indicator we have for slow start efficiency in metrics
I can not think of a better indicator. That said, not sure it is a good indicator. @mwelzl mentioned at IETF 124 that the cwnd after slow start (i.e. half the slow start exit cwnd) is still usually too high. See example in slide 5 below.
https://datatracker.ietf.org/meeting/120/materials/slides-120-iccrg-ssthresh-after-slow-start-overshoot-01.pdf
Maybe number of RTT till first loss after slow start is better. I have to give this more thought. What does the literature say?
About "usually": it depends on the queue length (the longer, the more likely), and pacing will also make it more likely to be too high. I suspect it often is, but I don't have much data yet.
Which brings me to the indicator - no I don't think it's good, because: how exactly do you define the loss event? And, as my slides show, you're not going to get close to that value anyway upon loss (especially not with Cubic which uses beta 0.7). Number of RTTs till first loss after slow start sounds interesting - an indication of how long we able to stay in slow start. Perhaps also the number of RTTs of the loss recovery period following it?
Thanks for the input and shared slide deck, this is very interesting. I will also have to give this more thought and take some time to go over the material.
I have opened #3162 for further discussion on the point of loss based slow start exit having multiple losses with some initial experimentation data on tuning the reduction factor in our simulator.
Concerning the metrics
If I understand correctly the above concerns are mainly about slow start exit upon loss. Imo the described metric comparing cwnd at slow start exit to cwnd before first non slow start loss should work fine as an approximation for slow start exit on heuristic, i.e. for measuring undershoot.
My intuition is that in the case of overshoot (and loss-exit) we would always overshoot by a lot due to the nature of the exponential growth. So that would always be inefficient.
How good a heuristic is at exiting without loss could be looked at through the slow_start_exit_reason metric above.
What are your thoughts on the slow_start_distance_to_ideal_exit metric idea using the cwnd diff under the assumption that it is only measured for connections that exited slow start in a non-loss way?
Big picture
I'd like to land on some kind of metric-suite that enables us to compare different slow start algos on real users. Right now that would be Reno, HyStart++ and SEARCH, as outlined in #2858.
@mwelzl could you expand on the idea of measuring number of RTTs of the loss recovery period following it (what is it? slow start exit or the first loss after?)? I don't quite understand what signal that'd give.
If you look at the plots in the slides, you'll see that the chance of having a "we have saturated the network, and hence will have a double loss with beta >= 0.5" loss type grows with the queue length. The larger the queue, the more likely it is. If this is a queue threshold Q, then with pacing, Q would be smaller.
So what I thought is that we could use heuristics that indicate the size of the queue. For instance, if the queue is a BDP, and we're alone, when the queue is full the measured RTT will have doubled. From this kind of relationship, we could make an estimate...
But maybe your slow_start_distance_to_ideal_exist metric is more direct? I don't know how exactly it's calculated, and I'm not quite sure how to use it, but I have a nagging feeling that using the sequence number that DupACKs tell us about could also be good somehow?
This is all stuff that I have in my "think about it and do some research when you have time" list...