inference [Llama3-8b-edge] Singlestream performnace mode samples

@arjunsuresh @pgmpablo157321 While giving a run of the singlestream performance mode, I noticed that the number of samples being run is much higher than the 5k that was designated by the taskforce. I'm tracking the number of samples being received and it's gone beyond even 10k.

The log details also has some weird values for the generated and effective min query count (I only used the default mlperf.conf parameters)

:::MLLOG {"key": "qsl_reported_total_count", "value": 5000, "time_ms": 0.027256, "namespace": "mlperf::logging", "event_type": "POINT_IN_TIME", "metadata": {"is_error": false, "is_warning": false, "file": "loadgen.cc", "line_no": 1198, "pid": 3313545, "tid": 3313545}}
:::MLLOG {"key": "qsl_reported_performance_count", "value": 5000, "time_ms": 0.027256, "namespace": "mlperf::logging", "event_type": "POINT_IN_TIME", "metadata": {"is_error": false, "is_warning": false, "file": "loadgen.cc", "line_no": 1199, "pid": 3313545, "tid": 3313545}}
:::MLLOG {"key": "requested_min_query_count", "value": 100, "time_ms": 0.035443, "namespace": "mlperf::logging", "event_type": "POINT_IN_TIME", "metadata": {"is_error": false, "is_warning": false, "file": "test_settings_internal.cc", "line_no": 318, "pid": 3313545, "tid": 3313545}}
:::MLLOG {"key": "requested_max_query_count", "value": 0, "time_ms": 0.035443, "namespace": "mlperf::logging", "event_type": "POINT_IN_TIME", "metadata": {"is_error": false, "is_warning": false, "file": "test_settings_internal.cc", "line_no": 319, "pid": 3313545, "tid": 3313545}}
:::MLLOG {"key": "requested_performance_sample_count_override", "value": 5000, "time_ms": 0.035443, "namespace": "mlperf::logging", "event_type": "POINT_IN_TIME", "metadata": {"is_error": false, "is_warning": false, "file": "test_settings_internal.cc", "line_no": 337, "pid": 3313545, "tid": 3313545}}
:::MLLOG {"key": "effective_min_query_count", "value": 125000, "time_ms": 0.036073, "namespace": "mlperf::logging", "event_type": "POINT_IN_TIME", "metadata": {"is_error": false, "is_warning": false, "file": "test_settings_internal.cc", "line_no": 431, "pid": 3313545, "tid": 3313545}}
:::MLLOG {"key": "effective_max_query_count", "value": 0, "time_ms": 0.036073, "namespace": "mlperf::logging", "event_type": "POINT_IN_TIME", "metadata": {"is_error": false, "is_warning": false, "file": "test_settings_internal.cc", "line_no": 432, "pid": 3313545, "tid": 3313545}}
:::MLLOG {"key": "effective_min_sample_count", "value": 125000, "time_ms": 0.036073, "namespace": "mlperf::logging", "event_type": "POINT_IN_TIME", "metadata": {"is_error": false, "is_warning": false, "file": "test_settings_internal.cc", "line_no": 433, "pid": 3313545, "tid": 3313545}}
:::MLLOG {"key": "effective_performance_sample_count", "value": 5000, "time_ms": 0.036073, "namespace": "mlperf::logging", "event_type": "POINT_IN_TIME", "metadata": {"is_error": false, "is_warning": false, "file": "test_settings_internal.cc", "line_no": 451, "pid": 3313545, "tid": 3313545}}
:::MLLOG {"key": "generated_query_count", "value": 125000, "time_ms": 17.024346, "namespace": "mlperf::logging", "event_type": "POINT_IN_TIME", "metadata": {"is_error": false, "is_warning": false, "file": "loadgen.cc", "line_no": 427, "pid": 3313545, "tid": 3313545}}

This is an unexpected behavior. Below is the last processed queries I've tracked so far

Processed: 10100 queries
Processed: 10150 queries
Processed: 10200 queries
Processed: 10250 queries
Processed: 10300 queries
Processed: 10350 queries
Processed: 10400 queries
Processed: 10450 queries
Processed: 10500 queries
Processed: 10550 queries

Jul 24 '25 16:07 attafosu

Loadgen by default never stops at the dataset size for performance runs. We have to use max_duration or target_latency values to control the duration. Default mlperf.conf has 10 milliseconds as the target latency. So, to complete 660s (minimum duration), it'll need 660000 samples.

Jul 24 '25 16:07 arjunsuresh

Thanks @arjunsuresh. After specifying the target_latency the counts are all 5k now. As this may not be obvious to first time users, should it be stated in the instructions for the singlestream.

Jul 24 '25 19:07 attafosu

With the latest changes for llama3-8b-edge, our runs are getting stuck at 477 samples for the singlestream. Not sure why.

Adding requests: 100%|██████████| 1/1 [00:00<00:00, 4228.13it/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it, est. speed input: 206.01 toks/s, output: 95.89 toks/s]
INFO:Llama-8B-SUT:Samples run: 466
INFO:Llama-8B-SUT:      BatchMaker time: 2.86102294921875e-06
INFO:Llama-8B-SUT:      Inference time: 1.3355934619903564
INFO:Llama-8B-SUT:      Postprocess time: 3.8623809814453125e-05
INFO:Llama-8B-SUT:      ==== Total time: 1.3356349468231201
INFO:Llama-8B-SUT:IssueQuery started with 1 samples
INFO:Llama-8B-SUT:IssueQuery done
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 3711.77it/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.46s/it, est. speed input: 877.52 toks/s, output: 87.48 toks/s]
INFO:Llama-8B-SUT:Samples run: 467
INFO:Llama-8B-SUT:      BatchMaker time: 2.1457672119140625e-06
INFO:Llama-8B-SUT:      Inference time: 1.4645195007324219
INFO:Llama-8B-SUT:      Postprocess time: 9.846687316894531e-05
INFO:Llama-8B-SUT:      ==== Total time: 1.4646201133728027

Jul 30 '25 19:07 arjunsuresh

With the latest changes for llama3-8b-edge, our runs are getting stuck at 477 samples for the singlestream. Not sure why.

Adding requests: 100%|██████████| 1/1 [00:00<00:00, 4228.13it/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it, est. speed input: 206.01 toks/s, output: 95.89 toks/s]
INFO:Llama-8B-SUT:Samples run: 466
INFO:Llama-8B-SUT:      BatchMaker time: 2.86102294921875e-06
INFO:Llama-8B-SUT:      Inference time: 1.3355934619903564
INFO:Llama-8B-SUT:      Postprocess time: 3.8623809814453125e-05
INFO:Llama-8B-SUT:      ==== Total time: 1.3356349468231201
INFO:Llama-8B-SUT:IssueQuery started with 1 samples
INFO:Llama-8B-SUT:IssueQuery done
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 3711.77it/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.46s/it, est. speed input: 877.52 toks/s, output: 87.48 toks/s]
INFO:Llama-8B-SUT:Samples run: 467
INFO:Llama-8B-SUT:      BatchMaker time: 2.1457672119140625e-06
INFO:Llama-8B-SUT:      Inference time: 1.4645195007324219
INFO:Llama-8B-SUT:      Postprocess time: 9.846687316894531e-05
INFO:Llama-8B-SUT:      ==== Total time: 1.4646201133728027

@arjunsuresh I was able to finish a run last week without issue. Is this from the reference implementation?

Jul 30 '25 21:07 attafosu