Add Hugging face client

Open philschmid opened this issue 10 months ago • 2 comments

What does this PR do?

This PR adds a dedicated Hugging Face client, which allows llmperf user to benchmark Hugging Face models using TGI on the API inference, Inference Endpoints or Locally/any URL.

Below is an simple example

run tgi

docker run --gpus all -ti -p 8080:80   -e MODEL_ID=HuggingFaceH4/zephyr-7b-beta ghcr.io/huggingface/text-generation-inference:latest

run benchmark

export HUGGINGFACE_API_BASE="http://localhost:8080"
export MODEL_ID="HuggingFaceH4/zephyr-7b-beta"

python token_benchmark_ray.py \
--model $MODEL_ID \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api huggingface \
--additional-sampling-params '{}'

Mar 28 '24 12:03 philschmid

llmperf llmperf copied to clipboard

Add Hugging face client

What does this PR do?

llmperf
llmperf copied to clipboard