inference
inference copied to clipboard
Setting `min_query_count` for GPTJ
Running GPTJ even on accelerated systems can be quite demanding, as the Server latency constraint of 20 seconds suggests. For systems close to this threshold, meeting the minimum run duration of 10 minutes would require processing just over 30 samples.
However, when trying to set min_query_count in user.conf (or indeed in mlperf.conf proper) e.g.:
gptj.SingleStream.min_query_count = 100 gptj.SingleStream.max_query_count = 100 gptj.SingleStream.performance_sample_count_override = 13368 gptj.SingleStream.target_latency = 19000
I still see in mlperf_log_summary.txt:
min_query_count : 13368 max_query_count : 100
with the following experiment summary:
================================================ MLPerf Results Summary ================================================ SUT name : KILT_SERVER Scenario : SingleStream Mode : PerformanceOnly 90th percentile latency (ns) : xxxxxxxxxx Result is : INVALID Min duration satisfied : Yes Min queries satisfied : NO Early stopping satisfied: Yes Recommendations: * The test exited early, before enough queries were issued. See the detailed log for why this may have occurred. Early Stopping Result: * Processed at least 64 queries (100). * Would discard 2 highest latency queries. * Early stopping 90th percentile estimate: yyyyyyyyyy * Not enough queries processed for 99th percentile early stopping estimate (would need to process at least 662 total queries).
Is there any reason why LoadGen enforces this? I know that we agreed that the minimum number of queries for Offline should cover the whole dataset, e.g. min_query_count == performance_sample_count_override == 13368 for GPTJ. It may be OK for Offline and Server, but for GPTJ SingleStream at 20 seconds per sample we would be looking at over 3 days (and double that for a power run!)
@mrmhodak @pgmpablo157321