inference Early Stopping and test pass/fail

From Early Stopping proposal:

Practical Impact The proposed change affects three types of systems. First, systems with overlatency percentiles less than the target tail latency will be able to stop early. Second, systems with overlatency percentiles very close to the target tail latency may require longer runs. Finally, systems with overlatency percentiles in the range (p - d, p) will be accepted if their run is sufficiently long where previously they would have been rejected. The ranges affected depend on the choice of parameters.

Currently we set d (tolerance) to zero in our LoadGen. With this I don't think we can have 'third case'. Is it possible if it's the second case, Early Stopping wants test to be run forever?

In below loadgen.cc line 755, together with line 936, for example, we don't really prioritize if min_query_count is satisfied (ES might not care min_duration) over early stopping being satisfied; and I wonder we can see false failures, because of ES wanting somehow more queries to be seen? In other words, if min_query_count & min_duration all satisfied, is there any reason ES's necessary condition impacts the test being passed? https://github.com/mlcommons/inference/blob/master/loadgen/loadgen.cc#L755 https://github.com/mlcommons/inference/blob/master/loadgen/loadgen.cc#L936

@ckstanton @tjablin I appreciate for any thought - thank you!

Feb 11 '22 15:02 nv-jinhosuh

Currently we set d (tolerance) to zero in our LoadGen. With this I don't think we can have 'third case'. Is it possible if it's the second case, Early Stopping wants test to be run forever?

Correct. In setting d = 0, we no longer have this 3rd case. In the 2nd case, early stopping should only want to run forever if the overlatency percentile of the system is identical to the target percentile. As soon as the overlatency percentile is even just slightly higher than the target, it will eventually terminate with success if the number of processed queries is large enough.

In below loadgen.cc line 755, together with line 936, for example, we don't really prioritize if min_query_count is satisfied (ES might not care min_duration) over early stopping being satisfied; and I wonder we can see false failures, because of ES wanting somehow more queries to be seen? In other words, if min_query_count & min_duration all satisfied, is there any reason ES's necessary condition impacts the test being passed?

We could see “false failures” in the sense that a run that might have passed with the previous rules (i.e., using min_query_count, min_duration, and observed tail latency) might now fail with early stopping. For systems with overlatency percentiles close to (but still higher than) the target, a larger min_query_count should be selected, to allow for more queries to be processed and early stopping to pass.

Feb 11 '22 17:02 ckstanton

Thanks @ckstanton

For systems with overlatency percentiles close to (but still higher than) the target, a larger min_query_count should be selected, to allow for more queries to be processed and early stopping to pass.

For this, since we don't know the required min_query_count that the policy wants (I believe we don't put this info anywhere in the LoadGen), I think you cannot really use settings.min_query_count to pass the test barring the early stopping passing criteria. I see the limitation in the current setup.

I still feel like it's going to be hard-ask to run more than query count requirement + duration requirement that the policy defines, just to make sure early stopping criteria to be met. I think SingleStream/MultiStream scenarios are fine as early stopping won't ask more than 'statically chosen' query count requirement. Server scenario would see that the required query count for early stopping to be dynamically moving, and it's possible we just have to run really really long, which is completely against the purpose of early stopping idea. I am worrying the case where query count is added as early stopping asks to add more, and then the criteria does not meet again, resulting in early stopping asking to add more queries. This can go forever.

In other words, I see the uncertainty about how many queries early stopping requires to pass in Server scenario, is troublesome.

Feb 11 '22 17:02 nv-jinhosuh

The effective required min_query_count depends on the underlying overlatency percentile of the system. Based on the observed overlatency percentile of a run, it would be possible to have loadgen estimate the number of samples needed for early stopping to terminate successfully on a future run. It sounds like it would be useful to implement this, but I think it’s very close to the submission deadline to be adding more loadgen changes.

In the meantime, I could add a table showing some examples of number of queries run and the effective overlatency percentiles, so that if a run is unsuccessful, submitters might have a better idea of how many queries they should try. I believe both overlatency and total queries are now logged, so one could take a look at those values, compare the overlatency percentile to the table, and choose a new min_query_count accordingly.

Just to give a sense of how far off the target percentile you’d need to be for this to come up, if you were to run 215,524 queries, then the effective percentile used by early stopping would be 99.05% instead of 99%. The previous min_query_count for server scenario was higher than this (270,366), so early stopping would only cause longer runs if the system’s overlatency percentile fell between 99% and 99.05%.

Feb 11 '22 18:02 ckstanton

Thank you @ckstanton - I think we cannot do anything about this for 2.0, but for 2.1, I think we need to bring the policy requirements of query count and duration into LoadGen. Then we can 'disable' the Early Stopping once we meet those requirements.

Also, I am curious if it makes sense, if we want Early Stopping to really halt the test 'early', when it sees enough samples. I know it's all about statistics, but this will make the runs much easier to many people. LoadGen handles the histogram update during non-critical path (at least it tries), and we might be able to pick the real 90% latency vs ES estimate to see if it goes over a certain threshold, to eventually 'early stop' the test. Does this make sense to you?

Feb 15 '22 14:02 nv-jinhosuh

Thank you @ckstanton - I think we cannot do anything about this for 2.0, but for 2.1, I think we need to bring the policy requirements of query count and duration into LoadGen. Then we can 'disable' the Early Stopping once we meet those requirements.

There are some cons for having a min_query threshold where below we report early stopping estimate, and above we report seen overlatency percentile:

The description of the metric becomes messy; you’re no longer comparing apples to apples. When the run uses early stopping, the metric is “we are 99% confident that the underlying latency percentile is at least this good”. For a longer run past the min_query threshold, it would be reporting “seen overlatency percentile” instead.
There will be a noticeable jump between running for, say (min_query - 1) number of queries and running min_query queries.

Though it’s certainly reasonable to see these cons and decide that they’re worth dealing with in order to stop submitters from feeling like they need to run many queries in order to get a reasonable percentile estimate. I’m curious what others in the WG think about this.

Also, I am curious if it makes sense, if we want Early Stopping to really halt the test 'early', when it sees enough samples. I know it's all about statistics, but this will make the runs much easier to many people. LoadGen handles the histogram update during non-critical path (at least it tries), and we might be able to pick the real 90% latency vs ES estimate to see if it goes over a certain threshold, to eventually 'early stop' the test. Does this make sense to you?

This would be great to implement for the future. For server, the ideal behaviour would be pretty straightforward… just keep processing queries until early stopping passes. For single stream, is your suggestion that we set some kind of tolerance threshold, and once the early stopping estimate and the seen 90%ile latency are close enough, loadgen would stop the test?

Feb 15 '22 22:02 ckstanton

The description of the metric becomes messy; you’re no longer comparing apples to apples. When the run uses early stopping, the metric is “we are 99% confident that the underlying latency percentile is at least this good”. For a longer run past the min_query threshold, it would be reporting “seen overlatency percentile” instead.

Thanks @ckstanton. I (think I) understand what you meant by this. But at the same time, we don't give a rigorous meaning to our current 90%-tile latency and/or 99%-tile latency (or at least we don't advertise). When we compare the current metrics vs ES metrics, I think we will end up saying the similar thing though -- they both are statistically meaningful metrics and given our model, they are still an approximate. I don't intend to say 'since it's simplified/approximate, it's irrelevant or less important'. I would love to do the opposite, i.e. strengthen the link to be rigorous on what we say. :) But since this is a benchmarking, it's also important to be less confusing. My question would be: "If the test passes the policy's duration and query count requirement, what does the Early Stopping Estimate really say?" I get what it says, but to many submitters it's just confusing metric they won't care too much in this case (at least I think so). Is there any reason to report ES estimate instead of 90%-tile latency if duration/query-count requirements are met? I really cannot find the reason.

And uh, we allow extrapolate metrics from one another, like multistream to singlestream etc today (and we don't advertise what it really means, other than that is acceptable as they are pessimistic). I think jumping from ES estimate to, say, 90%-tile latency would be in similar situation.

Or as you said, we might need to be more precise on what this metric is for, by annotating that, it's derived from ES estimate or converted from MultiStream etc.

For single stream, is your suggestion that we set some kind of tolerance threshold, and once the early stopping estimate and the seen 90%ile latency are close enough, loadgen would stop the test?

Yeah I didn't deeply think about what'd be the best mechanism. But I imagine something like that. For Server, when ES sees its own requirement met, we can halt the test. Or we can see how long this requirement is met and then halt, to avoid being on the edge. For Single/Multi-Stream, since we have a predetermined minimal query count, and since we know how the estimate approaches to the 90%-tile, we might be able to find a 'halting' query count threshold.

Feb 15 '22 23:02 nv-jinhosuh

BTW, I really hit this yesterday night; when I tune the Server scenario to the extreme, i.e. within 100ms requirement, the system gives 99%-tile latency of 99.999ms, regardless of it meets both required duration and query count, early stopping trips the test to fail, asking more queries to run. I have to lower the target QPS to avoid this, and this really hurts the usefulness of Early Stopping in my opinion (in the current form).

Feb 16 '22 15:02 nv-jinhosuh

I think it's worth summarizing what we have been discussing, together with my claims and my proposals. :)

Claims: a. Scenarios have required (minimum) duration and query counts; they are set in that the samples collected during the run can represent the true set, in a certain confidence (such as, number of collected samples are these many to represent the distribution of the unlimited number of samples, for say, latency, and that passes hypothesis test with 99% confidence interval; and/or the duration has to be this long so that we can see the machine is in hot state). b. Requirements defined from a. can be really painful for some very slow machines, especially for the number of queries. Therefore Early Stopping estimates can help in this case, to reduce the number of queries to run, by providing estimate that is guaranteed to be 'over' the bar, with again, a certain statistical confidence. c. We allow estimating one metric of a certain scenario from another metric of a different scenario, i.e. one is allowed to estimate MultiStream latency using SingleStream latency. This is also to ease the effort arising from a., in the same sense of b.

Observations: i. The model working behind the scene should be as rigorous as possible and to the science. ii. The results published want to be as simple, precise as possible and to the point. iii. Because of ii., we generally don't say c. was used when we report the result. Submitters knows behind the scene what happened though, but advertising all these makes too much confusion. iv. There are many submitters that understand what happens behind the scene, while it is also possible that some submitters might not care about these details -- rather they want to know what metrics to be tuned and submitted. v. Today if we collect metrics using Early Stopping estimate, we still submit that number as if it is the original target metric.

Proposals:

Since we unify the metric, estimated from c, into the one single result key and value pair, why don't we do the same for b., i.e. early stopping. 1-1. When the test terminates below the requirements from a., but early stopping can estimate the valid metric, let us report that metric instead of what's collected. We can be 'verbose' in showing what happened but the final string we see at the bottom of the runs would be the early stopping estimates and/or the QPS validated by early stopping. 1-2. If the test runs long enough to fulfill the requirements from a., we can take the early stopping out of the picture. It's of course recommended to print out some useful info but the final results: metric and pass/fail are solely decided in original way. 1-3. It's important to note that the metrics are not from the same definition -- early stopping estimate and the original target percentile metric, converted metrics. Whether to annotate this somewhere or not can be further discussed.
Let us have a configuration parameter like allow_earlystop_to_halt_test_early. When this is set by the user, early stop can try halting the test based on some 'threshold'. 2-1. We can discuss this further: 1) if early stopping sees the valid outcome, 2) if that valid outcome has been held true for a certain period, and/or 3) if the early stopping estimates are within 2% above the target metric, etc.

Feb 16 '22 17:02 nv-jinhosuh

Thanks, @nv-jinhosuh, for putting this together!

The proposal to use early stopping estimates for shorter runs, and to not use them (i.e. report seen percentiles instead) beyond the minimum duration and query count sounds reasonable to me. We’d have to think about how to communicate the meaning of this metric… but submitters who are already running the minimum duration wouldn’t see a change, so they wouldn’t need to worry about these details.

It would also be useful to get feedback from submitters this round on early stopping pain points.

I think regardless of how we decide on proposal 1, proposal 2 would be very useful, and something I’d like to look into. It’s clear what the halting criteria should be for server (if early stopping is valid, we stop early). But I still need to think a bit about how this might work for single / multi stream.

Feb 17 '22 20:02 ckstanton

inference inference copied to clipboard

Early Stopping and test pass/fail

inference
inference copied to clipboard