Update Loadgen version
Currently we use the commit 238d035 from Feb 29, 2024.
We should update the Loadgen to latest version. https://github.com/mlcommons/inference/commits/master/loadgen
We currently use the mobile_update branch, which contains commits related to enforcing a maximum duration. This is not in the master branch. Therefore, we have 3 options:
- Merge
masterinto themobile_updatebranch and maintain our own branch for loadgen. - Use the
masterbranch but apply the commits as a patch. - Open a PR in the inference repo to merge the changes into
master.
@freedomtan, @Mostelk, @mohitmundhragithub, @farook-edev, I would like your advice on which path we should take. I personally prefer option 2 (apply a patch) to keep our modifications centralised in this mobile_app_open repo. Option 3 may be the cleanest option, but I'm not sure if we can do that (else we've already gone that way in the past).
Looking at loadgen's documentation, they seem to require any changes to be upstreamed as a rule, does having our own branch satisfy this rule? If not, then I don't think we can go with either options 1 or 2.
The above issue aside, I personally agree that 3 is the cleanest, but if it proves too difficult/bureaucratic, then 1 (and optionally maintaining a fork rather than a branch) would be the next best thing IMO. That's because it keeps loadgen as a dependency that can simply be used by the app, which is cleaner than using a patch IMO.
@anhappdev The current loadgen version we use already has token latencies, we should be able to complete the LLM android implementation without the need for an upgrade.
In the same vein, the API should be similar enough to where we'd be able to upgrade seamlessly without the need for code rewrites.
IMHO, to maintain consistency with our previous approach (before using the mobile_update), we should update the mobile_update branch to a specific “stable” version, say v5.1, instead of the master branch.
https://github.com/mlcommons/inference/tags
Let's try to rebase the mobile_update to the master branch and send a PR. Then we can use future versions like v6.0. @anhappdev
- https://github.com/mlcommons/inference/pull/2394/
I'm not sure where to report this, but there was a question some weeks back about whether or not LoadGen excludes TTFT from TPS counting. I found this line which confirms that it does in fact only include the duration after the first output token is generated (latency is the variable representing the time until first output token is generated). @freedomtan @Mostelk