mobile_app_open what to use in the performance model

to copy contents from https://github.com/mlcommons/mobile_app_open/pull/1040#issuecomment-3454731672

Nov 11 '25 06:11 freedomtan

max input tokens: 1,024 max output tokens: 1,024/2,048

(input tokens, output tokens) = ((1024, 2048), (1024, 512))

query == sample == prompt, here

@farook-edev: please verify that given a list of 10 samples, make sure that we can get all the 10 samples in one round when randomization is used.

Nov 11 '25 07:11 freedomtan

max input tokens: 1,024 max output tokens: 1,024/2,048

(input tokens, output tokens) = ((1024, 2048), (1024, 512))

query == sample == prompt, here

@farook-edev: please verify that given a list of 10 samples, make sure that we can get all the 10 samples in one round when randomization is used.

The only scenario I could find where all samples are not used is when max_duration_ms is specified, in case of early termination, the run will be marked INVALID, this is true unless 64 queries are processed, see below:

================================================
MLPerf Results Summary
================================================
SUT name : TFLite
Scenario : SingleStream
Mode     : PerformanceOnly
90.0th percentile latency (ns) : 1670079764
90.0th first token percentile latency (ns) : 1626379315
Result is : INVALID
  Min duration satisfied : Yes
 Min queries satisfied : Skipped
  Early stopping satisfied: NO
Recommendations:
 * The test exited early, before enough queries were issued.
   See the detailed log for why this may have occurred.
TTFT Early Stopping Result:

TPOT Early Stopping Result:
 * Only processed 2 queries.
 * Need to process at least 64 queries for early stopping.

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 0.96
QPS w/o loadgen overhead        : 0.74

Min latency (ns)                : 1042107776
Max latency (ns)                : 1670079764
Mean latency (ns)               : 1356093770
50.00 percentile latency (ns)   : 1670079764
90.00 percentile latency (ns)   : 1670079764
95.00 percentile latency (ns)   : 1670079764
97.00 percentile latency (ns)   : 1670079764
99.00 percentile latency (ns)   : 1670079764
99.90 percentile latency (ns)   : 1670079764

TPS w/ loadgen overhead         : 3.84
TPS w/o loadgen overhead        : 1.47
Min First Token latency (ns)                : 996649161
Max First Token latency (ns)                : 1626379315
Mean First Token latency (ns)               : 1311514238
50.00 percentile first token latency (ns)   : 1626379315
90.00 percentile first token latency (ns)   : 1626379315
95.00 percentile first token latency (ns)   : 1626379315
97.00 percentile first token latency (ns)   : 1626379315
99.00 percentile first token latency (ns)   : 1626379315
99.90 percentile first token latency (ns)   : 1626379315

Min Time per Output Token (ns)                : 43700449
Max Time per Output Token (ns)                : 45458615
Mean Time per Output Token (ns)               : 44579532
50.00 percentile time to output token (ns)   : 45458615
90.00 percentile time to output token (ns)   : 45458615
95.00 percentile time to output token (ns)   : 45458615
97.00 percentile time to output token (ns)   : 45458615
99.00 percentile time to output token (ns)   : 45458615
99.90 percentile time to output token (ns)   : 45458615

================================================
Test Parameters Used
================================================
samples_per_query : 1
target_qps : 1000
ttft_latency (ns): 100000000
tpot_latency (ns): 100000000
max_async_queries : 1
min_duration (ms): 100
max_duration (ms): 500
min_query_count : 100
max_query_count : 0
qsl_rng_seed : 3066443479025735752
sample_index_rng_seed : 10688027786191513374
schedule_rng_seed : 14962580496156340209
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 100

I'm not sure where this 64 query figure comes from, but I suspect it should be adjustable.

I should mention that if this value isn't satisfied (as in the dataset not providing enough samples), some samples may be reused, unless loadgen's performance_issue_unique flag is set, upon which the maximum number of samples will be used, and the run will most likely be marked invalid for having too few queries.

Nov 18 '25 01:11 farook-edev

Due to the small input output size of the IFEval and Tiny-MMLU zero shot, I suggest we use Tiny-MMLU Few Shot instead, for both performance and accuracy. Here is the Tiny-MMLU Few Shot Input/Output sizes for float Llama 3.1 8B, with max output size of 1024, max input size of 2048:

idx	prompt_len	gen_len
0	675	138
1	674	137
2	1664	714
3	1616	309
4	1726	274
5	358	3072
6	351	57
7	372	190
8	348	73
9	638	141
10	621	86
11	428	45
12	471	53
13	460	85
14	849	759
15	930	373
16	609	3072
17	451	55
18	367	137
19	344	88
20	313	71
21	492	3072
22	581	10
23	572	315
24	559	82
25	599	348
26	584	859
27	551	152
28	491	25
29	529	322
30	573	409
31	2954	2445
32	383	21
33	380	115
34	481	99
35	495	38
36	374	136
37	553	195
38	597	816
39	576	3072
40	540	3072
41	566	1240
42	517	28
43	497	71
44	864	799
45	759	235
46	1325	86
47	1334	111
48	342	30
49	347	87
50	336	18
51	672	288
52	679	225
53	660	115
54	387	22
55	433	154
56	421	43
57	344	42
58	324	31
59	294	7
60	331	27
61	512	119
62	673	20
63	671	15
64	668	21
65	678	17
66	665	48
67	639	89
68	648	130
69	620	674
70	584	67
71	329	57
72	329	28
73	342	55
74	548	22
75	573	129
76	545	67
77	565	132
78	512	13
79	731	301
80	632	109
81	1695	145
82	1592	205
83	1474	134
84	1835	181
85	1769	278
86	1200	701
87	1062	1092
88	1064	1116
89	1148	298
90	1029	1089
91	570	77
92	575	499
93	608	120
94	400	6
95	1280	88
96	474	98
97	449	88
98	354	115
99	280	52

Jan 13 '26 06:01 Mostelk

@farook-edev please provide json or txt file for the 100 few-shot tiny MMLU prompts implemented in the app

Jan 14 '26 18:01 Mostelk

When using template to enable first letter as the answer:

Index	Prompt Length	Generation Length
0	641	1024
1	640	1024
2	1630	1024
3	1582	1024
4	1692	1024
5	324	1024
6	317	1024
7	338	1024
8	314	1024
9	604	1024
10	587	1024
11	394	1024
12	437	1024
13	426	1024
14	815	1024
15	896	1024
16	575	1024
17	417	1024
18	333	1024
19	310	1024
20	279	1024
21	458	1024
22	547	1024
23	538	1024
24	525	1024
25	565	1024
26	550	1024
27	517	1024
28	457	1024
29	495	1024
30	539	1024
31	2920	1024
32	349	1024
33	346	1024
34	447	1024
35	461	1024
36	340	1024
37	519	1024
38	563	1024
39	542	1024
40	506	1024
41	532	1024
42	483	1024
43	463	1024
44	830	1024
45	725	1024
46	1291	1024
47	1300	1024
48	308	1024
49	313	1024
50	302	1024
51	638	1024
52	645	1024
53	626	1024
54	353	1024
55	399	1024
56	387	1024
57	310	1024
58	290	1024
59	260	1024
60	297	1024
61	478	1024
62	639	1024
63	637	1024
64	634	1024
65	644	1024
66	631	1024
67	605	1024
68	614	1024
69	586	1024
70	550	1024
71	295	1024
72	295	1024
73	308	1024
74	514	1024
75	539	1024
76	511	1024
77	531	1024
78	478	1024
79	697	1024
80	598	1024
81	1661	1024
82	1558	1024
83	1440	1024
84	1801	1024
85	1735	1024
86	1166	1024
87	1028	1024
88	1030	1024
89	1114	1024
90	995	1024
91	536	1024
92	541	1024
93	574	1024
94	366	1024
95	1246	1024
96	440	1024
97	415	1024
98	320	1024
99	246	1024

Jan 14 '26 23:01 Mostelk

@farook-edev is it possible to truncate prompt 31 to 2048 (from the start, remove one or two input examples)

Jan 14 '26 23:01 Mostelk

Loadgen min samples is 64 -> App will provide red color if you run less than 64, even with early stopping (5 min limit)

Choices for max limit, for TinyMMLU Few-shot 1024 -> 3 prompts -> Eliminate this one, for performance 128 -> 30 prompts 64 -> 60 prompts

Collect enough statistics for both time to first token (input prompt) and Token/sec (Decode) @farook-edev Let us pick 128 max output tokens as default value, and make it configurable from CLI LLM App

Feb 04 '26 23:02 Mostelk