mobile_app_open icon indicating copy to clipboard operation
mobile_app_open copied to clipboard

what to use in the performance model

Open freedomtan opened this issue 3 months ago • 6 comments

to copy contents from https://github.com/mlcommons/mobile_app_open/pull/1040#issuecomment-3454731672

freedomtan avatar Nov 11 '25 06:11 freedomtan

max input tokens: 1,024 max output tokens: 1,024/2,048

(input tokens, output tokens) = ((1024, 2048), (1024, 512))

query == sample == prompt, here

@farook-edev: please verify that given a list of 10 samples, make sure that we can get all the 10 samples in one round when randomization is used.

freedomtan avatar Nov 11 '25 07:11 freedomtan

max input tokens: 1,024 max output tokens: 1,024/2,048

(input tokens, output tokens) = ((1024, 2048), (1024, 512))

query == sample == prompt, here

@farook-edev: please verify that given a list of 10 samples, make sure that we can get all the 10 samples in one round when randomization is used.

The only scenario I could find where all samples are not used is when max_duration_ms is specified, in case of early termination, the run will be marked INVALID, this is true unless 64 queries are processed, see below:

================================================
MLPerf Results Summary
================================================
SUT name : TFLite
Scenario : SingleStream
Mode     : PerformanceOnly
90.0th percentile latency (ns) : 1670079764
90.0th first token percentile latency (ns) : 1626379315
Result is : INVALID
  Min duration satisfied : Yes
 Min queries satisfied : Skipped
  Early stopping satisfied: NO
Recommendations:
 * The test exited early, before enough queries were issued.
   See the detailed log for why this may have occurred.
TTFT Early Stopping Result:

TPOT Early Stopping Result:
 * Only processed 2 queries.
 * Need to process at least 64 queries for early stopping.

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 0.96
QPS w/o loadgen overhead        : 0.74

Min latency (ns)                : 1042107776
Max latency (ns)                : 1670079764
Mean latency (ns)               : 1356093770
50.00 percentile latency (ns)   : 1670079764
90.00 percentile latency (ns)   : 1670079764
95.00 percentile latency (ns)   : 1670079764
97.00 percentile latency (ns)   : 1670079764
99.00 percentile latency (ns)   : 1670079764
99.90 percentile latency (ns)   : 1670079764

TPS w/ loadgen overhead         : 3.84
TPS w/o loadgen overhead        : 1.47
Min First Token latency (ns)                : 996649161
Max First Token latency (ns)                : 1626379315
Mean First Token latency (ns)               : 1311514238
50.00 percentile first token latency (ns)   : 1626379315
90.00 percentile first token latency (ns)   : 1626379315
95.00 percentile first token latency (ns)   : 1626379315
97.00 percentile first token latency (ns)   : 1626379315
99.00 percentile first token latency (ns)   : 1626379315
99.90 percentile first token latency (ns)   : 1626379315

Min Time per Output Token (ns)                : 43700449
Max Time per Output Token (ns)                : 45458615
Mean Time per Output Token (ns)               : 44579532
50.00 percentile time to output token (ns)   : 45458615
90.00 percentile time to output token (ns)   : 45458615
95.00 percentile time to output token (ns)   : 45458615
97.00 percentile time to output token (ns)   : 45458615
99.00 percentile time to output token (ns)   : 45458615
99.90 percentile time to output token (ns)   : 45458615

================================================
Test Parameters Used
================================================
samples_per_query : 1
target_qps : 1000
ttft_latency (ns): 100000000
tpot_latency (ns): 100000000
max_async_queries : 1
min_duration (ms): 100
max_duration (ms): 500
min_query_count : 100
max_query_count : 0
qsl_rng_seed : 3066443479025735752
sample_index_rng_seed : 10688027786191513374
schedule_rng_seed : 14962580496156340209
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 100

I'm not sure where this 64 query figure comes from, but I suspect it should be adjustable.

I should mention that if this value isn't satisfied (as in the dataset not providing enough samples), some samples may be reused, unless loadgen's performance_issue_unique flag is set, upon which the maximum number of samples will be used, and the run will most likely be marked invalid for having too few queries.

farook-edev avatar Nov 18 '25 01:11 farook-edev

Due to the small input output size of the IFEval and Tiny-MMLU zero shot, I suggest we use Tiny-MMLU Few Shot instead, for both performance and accuracy. Here is the Tiny-MMLU Few Shot Input/Output sizes for float Llama 3.1 8B, with max output size of 1024, max input size of 2048:

idx prompt_len gen_len
0 675 138
1 674 137
2 1664 714
3 1616 309
4 1726 274
5 358 3072
6 351 57
7 372 190
8 348 73
9 638 141
10 621 86
11 428 45
12 471 53
13 460 85
14 849 759
15 930 373
16 609 3072
17 451 55
18 367 137
19 344 88
20 313 71
21 492 3072
22 581 10
23 572 315
24 559 82
25 599 348
26 584 859
27 551 152
28 491 25
29 529 322
30 573 409
31 2954 2445
32 383 21
33 380 115
34 481 99
35 495 38
36 374 136
37 553 195
38 597 816
39 576 3072
40 540 3072
41 566 1240
42 517 28
43 497 71
44 864 799
45 759 235
46 1325 86
47 1334 111
48 342 30
49 347 87
50 336 18
51 672 288
52 679 225
53 660 115
54 387 22
55 433 154
56 421 43
57 344 42
58 324 31
59 294 7
60 331 27
61 512 119
62 673 20
63 671 15
64 668 21
65 678 17
66 665 48
67 639 89
68 648 130
69 620 674
70 584 67
71 329 57
72 329 28
73 342 55
74 548 22
75 573 129
76 545 67
77 565 132
78 512 13
79 731 301
80 632 109
81 1695 145
82 1592 205
83 1474 134
84 1835 181
85 1769 278
86 1200 701
87 1062 1092
88 1064 1116
89 1148 298
90 1029 1089
91 570 77
92 575 499
93 608 120
94 400 6
95 1280 88
96 474 98
97 449 88
98 354 115
99 280 52

Mostelk avatar Jan 13 '26 06:01 Mostelk

@farook-edev please provide json or txt file for the 100 few-shot tiny MMLU prompts implemented in the app

Mostelk avatar Jan 14 '26 18:01 Mostelk

When using template to enable first letter as the answer:

Index Prompt Length Generation Length
0 641 1024
1 640 1024
2 1630 1024
3 1582 1024
4 1692 1024
5 324 1024
6 317 1024
7 338 1024
8 314 1024
9 604 1024
10 587 1024
11 394 1024
12 437 1024
13 426 1024
14 815 1024
15 896 1024
16 575 1024
17 417 1024
18 333 1024
19 310 1024
20 279 1024
21 458 1024
22 547 1024
23 538 1024
24 525 1024
25 565 1024
26 550 1024
27 517 1024
28 457 1024
29 495 1024
30 539 1024
31 2920 1024
32 349 1024
33 346 1024
34 447 1024
35 461 1024
36 340 1024
37 519 1024
38 563 1024
39 542 1024
40 506 1024
41 532 1024
42 483 1024
43 463 1024
44 830 1024
45 725 1024
46 1291 1024
47 1300 1024
48 308 1024
49 313 1024
50 302 1024
51 638 1024
52 645 1024
53 626 1024
54 353 1024
55 399 1024
56 387 1024
57 310 1024
58 290 1024
59 260 1024
60 297 1024
61 478 1024
62 639 1024
63 637 1024
64 634 1024
65 644 1024
66 631 1024
67 605 1024
68 614 1024
69 586 1024
70 550 1024
71 295 1024
72 295 1024
73 308 1024
74 514 1024
75 539 1024
76 511 1024
77 531 1024
78 478 1024
79 697 1024
80 598 1024
81 1661 1024
82 1558 1024
83 1440 1024
84 1801 1024
85 1735 1024
86 1166 1024
87 1028 1024
88 1030 1024
89 1114 1024
90 995 1024
91 536 1024
92 541 1024
93 574 1024
94 366 1024
95 1246 1024
96 440 1024
97 415 1024
98 320 1024
99 246 1024

Mostelk avatar Jan 14 '26 23:01 Mostelk

@farook-edev is it possible to truncate prompt 31 to 2048 (from the start, remove one or two input examples)

Mostelk avatar Jan 14 '26 23:01 Mostelk

Loadgen min samples is 64 -> App will provide red color if you run less than 64, even with early stopping (5 min limit)

Choices for max limit, for TinyMMLU Few-shot 1024 -> 3 prompts -> Eliminate this one, for performance 128 -> 30 prompts 64 -> 60 prompts

Collect enough statistics for both time to first token (input prompt) and Token/sec (Decode) @farook-edev Let us pick 128 max output tokens as default value, and make it configurable from CLI LLM App

Mostelk avatar Feb 04 '26 23:02 Mostelk