llmperf
llmperf copied to clipboard
Add Hugging face client
What does this PR do?
This PR adds a dedicated Hugging Face client, which allows llmperf
user to benchmark Hugging Face models using TGI on the API inference, Inference Endpoints or Locally/any URL.
Below is an simple example
run tgi
docker run --gpus all -ti -p 8080:80 -e MODEL_ID=HuggingFaceH4/zephyr-7b-beta ghcr.io/huggingface/text-generation-inference:latest
run benchmark
export HUGGINGFACE_API_BASE="http://localhost:8080"
export MODEL_ID="HuggingFaceH4/zephyr-7b-beta"
python token_benchmark_ray.py \
--model $MODEL_ID \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api huggingface \
--additional-sampling-params '{}'