vidur Hybrid execution time predictor to solve extrapolation issues

The current default random forest predictor is overfitting on the training set and is not able to generalize the batch execution time prediction to unseen token numbers or batch sizes. For example, it would use the largest time needed for the largest num_tokens for any other extrapolation.

This causes inaccuracy of end-to-end latency, especially for long context sequences.

Prefill Latency (TTFT):

Predictor	0.5	0.9	0.95	0.99
random-forest	2.09	8.59	10.49	12.81
hybrid-forest	10.11	37.01	43.33	47.36

E2E Latency:

Predictor	0.5	0.9	0.95	0.99
random-forest	10.97	28.08	30.50	49.27
hybrid-forest	47.04	90.13	100.81	111.52

We verified that utilizing a polynomial fit for extrapolation along with a random forest for interpolation would yield the best MEAP results and bring them closest to the real system.

Apr 03 '25 19:04 yl3469

@microsoft-github-policy-service agree

Apr 06 '25 03:04 yl3469

@nitinkedia7 please review this PR, it introduces an important fix

Apr 06 '25 07:04 AgrawalAmey

Hi @yl3469 , I spent quite a bit of time with this PR because the core issue it aims to solve (of extrapolating to say higher context length) makes Vidur usable for points it hasn't been profiled. However, I found several critical issues with this implementation:

GridSearchCV is used to train multiple predictors for the same operation e.g. mlp_up_proj and the best one is chosen. Each of the model uses a different hyperparameter for say random forest. Now say we have three predictors (lower, middle, upper) for the same operation. And there are L unique hyperparameter sets for lower, M for middle, and U for upper. Then the search space for grid search in current implementation blows up to L * M * U. Cache preparation didn't even finish in an hour when I tried to create predictors for 128k / 256k prefill attention operation. We need to drastically reduce performance cost. Note that this performance problem is not fundamental. We can find the best lower, middle, and upper model separately and then combine them only at inference. Search space is L + M + U instead of L * M * U.
We do not need a separate lower predictor because all profiling starts from the minimum (e.g. batch size of 1, context length of 2 etc.). There is a significant performance cost to adding each predictor (explained above) so if we can do without a predictor, we should.
We cache the results of predictions and use them a lookup tables at runtime. Say we have data till 128 batch size, and we want to predict till 512. A simpler implementation would be like this: The training function should train both a RF and a LR model with the full profiling data (i.e. till 128). When predicting below 128 it should use RF and for greater than 128 it should use LR. We remove the additional hyperparameter "say top 20%" for the data for upper model too.

This PR remains compatible with the codebase after the recent big PR #56 too. However, the implementation needs to be made much more performant and few new parameters.

Jun 26 '25 13:06 nitinkedia7