what to use in the performance model
to copy contents from https://github.com/mlcommons/mobile_app_open/pull/1040#issuecomment-3454731672
max input tokens: 1,024 max output tokens: 1,024/2,048
(input tokens, output tokens) = ((1024, 2048), (1024, 512))
query == sample == prompt, here
@farook-edev: please verify that given a list of 10 samples, make sure that we can get all the 10 samples in one round when randomization is used.
max input tokens: 1,024 max output tokens: 1,024/2,048
(input tokens, output tokens) = ((1024, 2048), (1024, 512))
query == sample == prompt, here
@farook-edev: please verify that given a list of 10 samples, make sure that we can get all the 10 samples in one round when randomization is used.
The only scenario I could find where all samples are not used is when max_duration_ms is specified, in case of early termination, the run will be marked INVALID, this is true unless 64 queries are processed, see below:
================================================
MLPerf Results Summary
================================================
SUT name : TFLite
Scenario : SingleStream
Mode : PerformanceOnly
90.0th percentile latency (ns) : 1670079764
90.0th first token percentile latency (ns) : 1626379315
Result is : INVALID
Min duration satisfied : Yes
Min queries satisfied : Skipped
Early stopping satisfied: NO
Recommendations:
* The test exited early, before enough queries were issued.
See the detailed log for why this may have occurred.
TTFT Early Stopping Result:
TPOT Early Stopping Result:
* Only processed 2 queries.
* Need to process at least 64 queries for early stopping.
================================================
Additional Stats
================================================
QPS w/ loadgen overhead : 0.96
QPS w/o loadgen overhead : 0.74
Min latency (ns) : 1042107776
Max latency (ns) : 1670079764
Mean latency (ns) : 1356093770
50.00 percentile latency (ns) : 1670079764
90.00 percentile latency (ns) : 1670079764
95.00 percentile latency (ns) : 1670079764
97.00 percentile latency (ns) : 1670079764
99.00 percentile latency (ns) : 1670079764
99.90 percentile latency (ns) : 1670079764
TPS w/ loadgen overhead : 3.84
TPS w/o loadgen overhead : 1.47
Min First Token latency (ns) : 996649161
Max First Token latency (ns) : 1626379315
Mean First Token latency (ns) : 1311514238
50.00 percentile first token latency (ns) : 1626379315
90.00 percentile first token latency (ns) : 1626379315
95.00 percentile first token latency (ns) : 1626379315
97.00 percentile first token latency (ns) : 1626379315
99.00 percentile first token latency (ns) : 1626379315
99.90 percentile first token latency (ns) : 1626379315
Min Time per Output Token (ns) : 43700449
Max Time per Output Token (ns) : 45458615
Mean Time per Output Token (ns) : 44579532
50.00 percentile time to output token (ns) : 45458615
90.00 percentile time to output token (ns) : 45458615
95.00 percentile time to output token (ns) : 45458615
97.00 percentile time to output token (ns) : 45458615
99.00 percentile time to output token (ns) : 45458615
99.90 percentile time to output token (ns) : 45458615
================================================
Test Parameters Used
================================================
samples_per_query : 1
target_qps : 1000
ttft_latency (ns): 100000000
tpot_latency (ns): 100000000
max_async_queries : 1
min_duration (ms): 100
max_duration (ms): 500
min_query_count : 100
max_query_count : 0
qsl_rng_seed : 3066443479025735752
sample_index_rng_seed : 10688027786191513374
schedule_rng_seed : 14962580496156340209
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 100
I'm not sure where this 64 query figure comes from, but I suspect it should be adjustable.
I should mention that if this value isn't satisfied (as in the dataset not providing enough samples), some samples may be reused, unless loadgen's performance_issue_unique flag is set, upon which the maximum number of samples will be used, and the run will most likely be marked invalid for having too few queries.
Due to the small input output size of the IFEval and Tiny-MMLU zero shot, I suggest we use Tiny-MMLU Few Shot instead, for both performance and accuracy. Here is the Tiny-MMLU Few Shot Input/Output sizes for float Llama 3.1 8B, with max output size of 1024, max input size of 2048:
| idx | prompt_len | gen_len |
|---|---|---|
| 0 | 675 | 138 |
| 1 | 674 | 137 |
| 2 | 1664 | 714 |
| 3 | 1616 | 309 |
| 4 | 1726 | 274 |
| 5 | 358 | 3072 |
| 6 | 351 | 57 |
| 7 | 372 | 190 |
| 8 | 348 | 73 |
| 9 | 638 | 141 |
| 10 | 621 | 86 |
| 11 | 428 | 45 |
| 12 | 471 | 53 |
| 13 | 460 | 85 |
| 14 | 849 | 759 |
| 15 | 930 | 373 |
| 16 | 609 | 3072 |
| 17 | 451 | 55 |
| 18 | 367 | 137 |
| 19 | 344 | 88 |
| 20 | 313 | 71 |
| 21 | 492 | 3072 |
| 22 | 581 | 10 |
| 23 | 572 | 315 |
| 24 | 559 | 82 |
| 25 | 599 | 348 |
| 26 | 584 | 859 |
| 27 | 551 | 152 |
| 28 | 491 | 25 |
| 29 | 529 | 322 |
| 30 | 573 | 409 |
| 31 | 2954 | 2445 |
| 32 | 383 | 21 |
| 33 | 380 | 115 |
| 34 | 481 | 99 |
| 35 | 495 | 38 |
| 36 | 374 | 136 |
| 37 | 553 | 195 |
| 38 | 597 | 816 |
| 39 | 576 | 3072 |
| 40 | 540 | 3072 |
| 41 | 566 | 1240 |
| 42 | 517 | 28 |
| 43 | 497 | 71 |
| 44 | 864 | 799 |
| 45 | 759 | 235 |
| 46 | 1325 | 86 |
| 47 | 1334 | 111 |
| 48 | 342 | 30 |
| 49 | 347 | 87 |
| 50 | 336 | 18 |
| 51 | 672 | 288 |
| 52 | 679 | 225 |
| 53 | 660 | 115 |
| 54 | 387 | 22 |
| 55 | 433 | 154 |
| 56 | 421 | 43 |
| 57 | 344 | 42 |
| 58 | 324 | 31 |
| 59 | 294 | 7 |
| 60 | 331 | 27 |
| 61 | 512 | 119 |
| 62 | 673 | 20 |
| 63 | 671 | 15 |
| 64 | 668 | 21 |
| 65 | 678 | 17 |
| 66 | 665 | 48 |
| 67 | 639 | 89 |
| 68 | 648 | 130 |
| 69 | 620 | 674 |
| 70 | 584 | 67 |
| 71 | 329 | 57 |
| 72 | 329 | 28 |
| 73 | 342 | 55 |
| 74 | 548 | 22 |
| 75 | 573 | 129 |
| 76 | 545 | 67 |
| 77 | 565 | 132 |
| 78 | 512 | 13 |
| 79 | 731 | 301 |
| 80 | 632 | 109 |
| 81 | 1695 | 145 |
| 82 | 1592 | 205 |
| 83 | 1474 | 134 |
| 84 | 1835 | 181 |
| 85 | 1769 | 278 |
| 86 | 1200 | 701 |
| 87 | 1062 | 1092 |
| 88 | 1064 | 1116 |
| 89 | 1148 | 298 |
| 90 | 1029 | 1089 |
| 91 | 570 | 77 |
| 92 | 575 | 499 |
| 93 | 608 | 120 |
| 94 | 400 | 6 |
| 95 | 1280 | 88 |
| 96 | 474 | 98 |
| 97 | 449 | 88 |
| 98 | 354 | 115 |
| 99 | 280 | 52 |
@farook-edev please provide json or txt file for the 100 few-shot tiny MMLU prompts implemented in the app
When using template to enable first letter as the answer:
| Index | Prompt Length | Generation Length |
|---|---|---|
| 0 | 641 | 1024 |
| 1 | 640 | 1024 |
| 2 | 1630 | 1024 |
| 3 | 1582 | 1024 |
| 4 | 1692 | 1024 |
| 5 | 324 | 1024 |
| 6 | 317 | 1024 |
| 7 | 338 | 1024 |
| 8 | 314 | 1024 |
| 9 | 604 | 1024 |
| 10 | 587 | 1024 |
| 11 | 394 | 1024 |
| 12 | 437 | 1024 |
| 13 | 426 | 1024 |
| 14 | 815 | 1024 |
| 15 | 896 | 1024 |
| 16 | 575 | 1024 |
| 17 | 417 | 1024 |
| 18 | 333 | 1024 |
| 19 | 310 | 1024 |
| 20 | 279 | 1024 |
| 21 | 458 | 1024 |
| 22 | 547 | 1024 |
| 23 | 538 | 1024 |
| 24 | 525 | 1024 |
| 25 | 565 | 1024 |
| 26 | 550 | 1024 |
| 27 | 517 | 1024 |
| 28 | 457 | 1024 |
| 29 | 495 | 1024 |
| 30 | 539 | 1024 |
| 31 | 2920 | 1024 |
| 32 | 349 | 1024 |
| 33 | 346 | 1024 |
| 34 | 447 | 1024 |
| 35 | 461 | 1024 |
| 36 | 340 | 1024 |
| 37 | 519 | 1024 |
| 38 | 563 | 1024 |
| 39 | 542 | 1024 |
| 40 | 506 | 1024 |
| 41 | 532 | 1024 |
| 42 | 483 | 1024 |
| 43 | 463 | 1024 |
| 44 | 830 | 1024 |
| 45 | 725 | 1024 |
| 46 | 1291 | 1024 |
| 47 | 1300 | 1024 |
| 48 | 308 | 1024 |
| 49 | 313 | 1024 |
| 50 | 302 | 1024 |
| 51 | 638 | 1024 |
| 52 | 645 | 1024 |
| 53 | 626 | 1024 |
| 54 | 353 | 1024 |
| 55 | 399 | 1024 |
| 56 | 387 | 1024 |
| 57 | 310 | 1024 |
| 58 | 290 | 1024 |
| 59 | 260 | 1024 |
| 60 | 297 | 1024 |
| 61 | 478 | 1024 |
| 62 | 639 | 1024 |
| 63 | 637 | 1024 |
| 64 | 634 | 1024 |
| 65 | 644 | 1024 |
| 66 | 631 | 1024 |
| 67 | 605 | 1024 |
| 68 | 614 | 1024 |
| 69 | 586 | 1024 |
| 70 | 550 | 1024 |
| 71 | 295 | 1024 |
| 72 | 295 | 1024 |
| 73 | 308 | 1024 |
| 74 | 514 | 1024 |
| 75 | 539 | 1024 |
| 76 | 511 | 1024 |
| 77 | 531 | 1024 |
| 78 | 478 | 1024 |
| 79 | 697 | 1024 |
| 80 | 598 | 1024 |
| 81 | 1661 | 1024 |
| 82 | 1558 | 1024 |
| 83 | 1440 | 1024 |
| 84 | 1801 | 1024 |
| 85 | 1735 | 1024 |
| 86 | 1166 | 1024 |
| 87 | 1028 | 1024 |
| 88 | 1030 | 1024 |
| 89 | 1114 | 1024 |
| 90 | 995 | 1024 |
| 91 | 536 | 1024 |
| 92 | 541 | 1024 |
| 93 | 574 | 1024 |
| 94 | 366 | 1024 |
| 95 | 1246 | 1024 |
| 96 | 440 | 1024 |
| 97 | 415 | 1024 |
| 98 | 320 | 1024 |
| 99 | 246 | 1024 |
@farook-edev is it possible to truncate prompt 31 to 2048 (from the start, remove one or two input examples)
Loadgen min samples is 64 -> App will provide red color if you run less than 64, even with early stopping (5 min limit)
Choices for max limit, for TinyMMLU Few-shot 1024 -> 3 prompts -> Eliminate this one, for performance 128 -> 30 prompts 64 -> 60 prompts
Collect enough statistics for both time to first token (input prompt) and Token/sec (Decode) @farook-edev Let us pick 128 max output tokens as default value, and make it configurable from CLI LLM App