inference icon indicating copy to clipboard operation
inference copied to clipboard

Early Stopping implementation might have a bug

Open nv-jinhosuh opened this issue 3 years ago • 9 comments

Early Stopping implemented in LoadGen seems to have a bug: overlatency count collapses to 1 as the total query count increases. It is likely that a very fast accelerator can go way more than min_query_count for meeting the min_duration and for any reason EarlyStopping may come in handy.

Also it's worth double checking why the above behavior happens, and see if currently used counts (meaningful counts > 1) are solid.

I have ported the C++ code to Python as in attached zip file earlystop.zip (please run it like python earlystop.py) to demonstrate what's going on. Hope this helps identifying the issue.

Picture1 Picture2

nv-jinhosuh avatar Feb 08 '22 19:02 nv-jinhosuh

Hi @nv-jinhosuh . Thanks for catching this. It turns out that the fix is simple... there's no complicated overflowing issue going on. There's currently an overlatency query bound that's set to (1 << 10). So when the overlatency count gets beyond 1024, the early stopping loop terminates early and we have the bug.

There's no need for this overlatency bound, and removing it seems to resolve the issue. So I'll put in a PR to fix this.

ckstanton avatar Feb 08 '22 20:02 ckstanton

@ckstanton ah, I see that. :) Thank you!

nv-jinhosuh avatar Feb 08 '22 20:02 nv-jinhosuh

@ckstanton Does this look like a manifestation of the same bug to you?

================================================
MLPerf Results Summary
================================================
SUT name : ArmNN_TFLite_SUT
Scenario : SingleStream
Mode     : PerformanceOnly
90th percentile latency (ns) : 15443391
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes
Early Stopping Result:
 * Processed at least 64 queries (45317).
 * Would discard 0 highest latency queries.
 * Early stopping 90th percentile estimate: 41232125
 * Early stopping 99th percentile estimate: 22679287

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 75.53
QPS w/o loadgen overhead        : 75.59

Min latency (ns)                : 9204559
Max latency (ns)                : 41232125
Mean latency (ns)               : 13228655
50.00 percentile latency (ns)   : 14008000
90.00 percentile latency (ns)   : 15443391
95.00 percentile latency (ns)   : 16327907
97.00 percentile latency (ns)   : 17329774
99.00 percentile latency (ns)   : 22200163
99.90 percentile latency (ns)   : 27176409

psyhtest avatar Feb 09 '22 02:02 psyhtest

A question to the WG. We have already collected lots of results like above, which would be rather onerous to rerun before the submission deadline. I'm wondering if we can submit them as they are, and rerun during the review period.

psyhtest avatar Feb 09 '22 02:02 psyhtest

That looks like the same bug to me.

On Tue, Feb 8, 2022, 6:43 PM Anton Lokhmotov @.***> wrote:

A question to the WG. We have already collected lots of results like above, which would be rather onerous to rerun before the submission deadline. I'm wondering if we can submit them as they are, and rerun during the review period.

— Reply to this email directly, view it on GitHub https://github.com/mlcommons/inference/issues/1083#issuecomment-1033286617, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABA5I6XTKOSS5BSKCU6ATDTU2HIGTANCNFSM5N3NJX6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

tjablin avatar Feb 09 '22 02:02 tjablin

That looks like the same bug to me.

Agreed. There should be on the order of ~4500 acceptable overlatency queries for this run, which is bigger than the threshold of 1024 where we're seeing the issue.

ckstanton avatar Feb 09 '22 02:02 ckstanton

@psyhtest BTW, since the log shows the test met the min_query/min_duration criteria, I don't think there's any need for you to use early stopping estimate. So I believe, regardless of the bug being manifested, you can submit the results using the normal measurement (just avoid using early stopping estimate). Maybe need a word in WG for confirming this.

nv-jinhosuh avatar Feb 09 '22 14:02 nv-jinhosuh

@nv-jinhosuh That's my feeling too. But IIRC we decided that the early estimate would be used as the metric regardless. And it is what the submission checker appears to do. Right, @ckstanton @pgmpablo157321?

psyhtest avatar Feb 09 '22 18:02 psyhtest

@nv-jinhosuh That's my feeling too. But IIRC we decided that the early estimate would be used as the metric regardless. And it is what the submission checker appears to do. Right, @ckstanton @pgmpablo157321?

Correct... the early stopping estimate is used as the metric, regardless of how many queries are processed

ckstanton avatar Feb 09 '22 19:02 ckstanton

outdated

mrasquinha-g avatar May 23 '23 11:05 mrasquinha-g