Hybrid execution time predictor to solve extrapolation issues
The current default random forest predictor is overfitting on the training set and is not able to generalize the batch execution time prediction to unseen token numbers or batch sizes. For example, it would use the largest time needed for the largest num_tokens for any other extrapolation.
This causes inaccuracy of end-to-end latency, especially for long context sequences.
Prefill Latency (TTFT):
| Predictor | 0.5 | 0.9 | 0.95 | 0.99 |
|---|---|---|---|---|
| random-forest | 2.09 | 8.59 | 10.49 | 12.81 |
| hybrid-forest | 10.11 | 37.01 | 43.33 | 47.36 |
E2E Latency:
| Predictor | 0.5 | 0.9 | 0.95 | 0.99 |
|---|---|---|---|---|
| random-forest | 10.97 | 28.08 | 30.50 | 49.27 |
| hybrid-forest | 47.04 | 90.13 | 100.81 | 111.52 |
We verified that utilizing a polynomial fit for extrapolation along with a random forest for interpolation would yield the best MEAP results and bring them closest to the real system.
@microsoft-github-policy-service agree
@nitinkedia7 please review this PR, it introduces an important fix
Hi @yl3469 , I spent quite a bit of time with this PR because the core issue it aims to solve (of extrapolating to say higher context length) makes Vidur usable for points it hasn't been profiled. However, I found several critical issues with this implementation:
-
GridSearchCVis used to train multiple predictors for the same operation e.g.mlp_up_projand the best one is chosen. Each of the model uses a different hyperparameter for say random forest. Now say we have three predictors (lower,middle,upper) for the same operation. And there are L unique hyperparameter sets forlower, M formiddle, andUfor upper. Then the search space for grid search in current implementation blows up to L * M * U. Cache preparation didn't even finish in an hour when I tried to create predictors for 128k / 256k prefill attention operation. We need to drastically reduce performance cost. Note that this performance problem is not fundamental. We can find the bestlower,middle, anduppermodel separately and then combine them only at inference. Search space is L + M + U instead of L * M * U. -
We do not need a separate
lowerpredictor because all profiling starts from the minimum (e.g. batch size of 1, context length of 2 etc.). There is a significant performance cost to adding each predictor (explained above) so if we can do without a predictor, we should. -
We cache the results of predictions and use them a lookup tables at runtime. Say we have data till 128 batch size, and we want to predict till 512. A simpler implementation would be like this: The training function should train both a RF and a LR model with the full profiling data (i.e. till 128). When predicting below 128 it should use RF and for greater than 128 it should use LR. We remove the additional hyperparameter "say top 20%" for the data for upper model too.
This PR remains compatible with the codebase after the recent big PR #56 too. However, the implementation needs to be made much more performant and few new parameters.