serve
serve copied to clipboard
[WIP] Include GPT Fast in torch.compile nightly benchmark workflow
Description
Please read our CONTRIBUTING.md prior to creating your first pull request.
The objective of this PR is to include the GPT Fast model with weights corresponding to Llama 7B with int4 quantization.
Steps to download Llama 7B weights in the benchmark host:
Ran a temporary workflow to download weights with the HUGGING_FACE_HUB_TOKEN in the commit 1e6088e
Results of successful run: https://github.com/pytorch/serve/actions/runs/7271384883/job/19811851851?pr=2857
Testing:
Ran benchmark workflow in the commit 31936c9
Results of the successful run: https://github.com/pytorch/serve/actions/runs/7272224847/job/19813999840?pr=2857 Benchmark report file: report.md
Updates in benchmark-ab.py script:
Also updated the benchmark-ab.py
script to include -l
in the ab
commands to allow variable response lengths without counting them as errors (https://httpd.apache.org/docs/2.4/programs/ab.html).
@namannandan @lxning What is the work remaining for this PR?