lmdeploy
lmdeploy copied to clipboard
Optimize mixtral
python3 \
benchmark/profile_throughput.py \
ShareGPT_V3_unfiltered_cleaned_split.json \
Mixtral-8x22B-v0.1 \
--backend pytorch \
--cache-max-entry-count 0.65 \
--num-prompts 3000 \
--concurrency 256 \
--tp 4
--------------------------------------------------
concurrency: 256
elapsed_time: 736.060s
first token latency(s)(min, max, ave): 1.400, 32.566, 9.528
per-token latency(s) percentile(50, 75, 95, 99): [0.157, 0.165, 0.463, 0.549]
number of prompt tokens: 741804
number of completion tokens: 712850
token throughput (completion token): 968.468 token/s
token throughput (prompt + completion token): 1976.272 token/s
RPS (request per second): 4.076 req/s
RPM (request per minute): 244.545 req/min
--------------------------------------------------