Performance differences between Ollama and gpustack when running embedding model
I have configured Ollama and GPustack to run the bge-m3 model. The bge-m3 model running on GPustack was also downloaded from Ollama and is executed using vLLM or llama-box. However, I noticed that when calling the bge-m3 model on GPustack, the GPU computing resources are not fully utilized, with utilization below 20%, whereas when calling the bge-m3 model on Ollama, the GPU utilization reaches over 80%.
I have confirmed that I am calling the same model in both cases. I also tried using different vector models and specifying parameters such as quantization, max-num-batched-tokens, kv-cache-dtype, and max-num-seqs when launching the model, but none of them worked.
Has anyone encountered this issue before? Could you share your thoughts and solutions with me? I would really appreciate it.
Although vllm supports GGUF format, in GPUStack we only support llama-box with GGUF models. Besides the GPU utilization, did you see any other difference (total time cost for the same embedding) between the Ollama and GPUStack?
Test script
import openai
import random
import nltk
import time
from nltk.corpus import words
nltk.download('words')
client = openai.OpenAI(
api_key="fake",
)
model_name = "bge-m3"
# Get a list of n random English words from the NLTK corpus
def get_random_words_from_dictionary(n=2000):
word_list = words.words()
return random.sample(word_list, n)
# Request embeddings from the model and measure time taken
def get_embeddings(text_list, model=model_name):
input_text = " ".join(text_list)
start_time = time.time()
response = client.embeddings.create(model=model, input=input_text)
end_time = time.time()
duration = end_time - start_time
embeddings = [item.embedding for item in response.data]
return embeddings, duration, response.usage.total_tokens
# Run multiple embedding requests and calculate average duration and tokens
def run_benchmark(base_url, runs=10):
print(f"\nš Testing model from {base_url}")
client.base_url = base_url
durations = []
tokens_used = []
for i in range(runs):
print(f"š Run {i + 1}/{runs}")
words_list = get_random_words_from_dictionary(2000)
_, duration, tokens = get_embeddings(words_list)
print(f"ā±ļø Duration: {duration:.2f}s, Tokens used: {tokens}")
durations.append(duration)
tokens_used.append(tokens)
avg_time = sum(durations) / len(durations)
avg_tokens = sum(tokens_used) / len(tokens_used)
print(f"\nš Average duration: {avg_time:.2f}s over {runs} runs")
print(f"š Average tokens used: {avg_tokens:.0f}")
return avg_time, avg_tokens
if __name__ == "__main__":
base_url = "http://192.168.1.100:11434/v1"
run_benchmark(base_url, runs=100)
Configurations are aligned to np=1, ctx-size=8192
Results
# 1-concurrency,1-batch
Ollama(0.9.1)
š Average duration: 0.30s over 100 runs
š Average tokens used: 5960
llama-box(v0.0.154)
š Average duration: 0.63s over 100 runs
š Average tokens used: 5965
llama-server(b5686)
š Average duration: 0.75s over 100 runs
š Average tokens used: 5966
1. Preparation
Adjust the script provided by @gitlawr with the following changes:
- Add a warmup round to eliminate boundary impact
- Parameterize
base url,model nameandwords size.
import openai
import random
import nltk
import time
import argparse
from nltk.corpus import words
nltk.download('words')
client = openai.OpenAI(
api_key="fake",
)
# Get a list of n random English words from the NLTK corpus
def get_random_words_from_dictionary(n):
word_list = words.words()
return random.sample(word_list, n)
# Request embeddings from the model and measure time taken
def get_embeddings(word_list, model):
input_text = " ".join(word_list)
start_time = time.time()
response = client.embeddings.create(model=model, input=input_text)
end_time = time.time()
duration = end_time - start_time
embeddings = [item.embedding for item in response.data]
return embeddings, duration, response.usage.total_tokens
# Run multiple embedding requests and calculate average duration and tokens
def run_benchmark(base_url, runs, model_name, words_size):
print(f"\nš Testing {model_name} model from {base_url}")
client.base_url = base_url
durations = []
tokens_used = []
# warm up
for i in range(3):
print(f"š Warmup {model_name} {i + 1}/{runs}")
word_list = get_random_words_from_dictionary(words_size)
get_embeddings(word_list, model=model_name)
for i in range(runs):
print(f"š Run {model_name} {i + 1}/{runs}")
word_list = get_random_words_from_dictionary(words_size)
_, duration, tokens = get_embeddings(word_list, model=model_name)
print(f"ā±ļø Duration: {duration:.2f}s, Tokens used: {tokens}")
durations.append(duration)
tokens_used.append(tokens)
avg_time = sum(durations) / len(durations)
avg_tokens = sum(tokens_used) / len(tokens_used)
print(f"\nš Average duration: {avg_time:.2f}s over {runs} runs")
print(f"š Average tokens used: {avg_tokens:.0f}")
return avg_time, avg_tokens
if __name__ == "__main__":
parser = argparse.ArgumentParser(exit_on_error=False, allow_abbrev=False)
parser.set_defaults(base_url='http://127.0.0.1:8080/v1', model_name='bge-m3', words_size=2000)
parser.add_argument('--base-url', type=str)
parser.add_argument('--model-name', type=str)
parser.add_argument('--words-size', type=int)
args = parser.parse_args()
run_benchmark(args.base_url, runs=100, model_name=args.model_name, words_size=args.words_size)
2. Launch Ollama / LLaMA Box
Ollama (lock to one Nvidia GPU, with 8192 context size and 1 parallel size)
$ CUDA_VISIBLE_DEVICES=0 \
OLLAMA_CONTEXT_LENGTH=8192 \
OLLAMA_NUM_PARALLEL=1 \
ollama serve
$ # wake up model
$ curl localhost:11434/v1/embeddings -d '{"model":"bge-m3","input":"why is the sky blue?","encoding_format":"float"}'
LLaMA Box (lock to one Nvidia GPU, with 8192 context size and 1 parallel size)
$ CUDA_VISIBLE_DEVICES=1 \
llama-box --embeddings -ngl 99 --host 0.0.0.0 --port 8080 \
-np 1 -c 8192 \
-m gpustack/bge-m3-GGUF/bge-m3-FP16.gguf
3. Run Testing Script
$ ./test.py --base-url http://127.0.0.1:11434/v1
# Ollama
š Average duration: 0.22s over 100 runs
š Average tokens used: 5963
$ ./test.py
# LLaMA Box
š Average duration: 0.58s over 100 runs
š Average tokens used: 5967
A 2000 random word list creates almost 5800-6100 tokens per request.
Ollama almost uses 1/3 time cost of LLaMA box.
4. Investige
Both Ollama and LLaMA Box are based on LLaMA.CPP, what causes the double gap?
Let's look at the launch logs.
Ollama
LLaMA Box
LLaMA.CPP introduces a batch logic to process the token decoding/encoding. The following pseudo-code shows how these two batch sizes coordinate:
for i = 0; i < n_batch; i+= n_ubatch:
batch_size = min(i+n_ubatch, n_batch)
<submit batch_szie tokens to process>
Ollama only did 1/4 work of LLaMA Box, which means Ollama only takes 512 tokens from 5800-6100 request tokens here, resulting in an incomplete embedding, see https://github.com/gpustack/gpustack/issues/950#issuecomment-2598210455.
LLaMA Box forces locking n_batch = n_ubatch = n_ctx for embedding model, see https://github.com/ggml-org/llama.cpp/pull/13076.
5. The True Comparison
Yes, make the request tokens under 512, almost 150 words.
$ ./test.py --base-url http://127.0.0.1:11434/v1 --words-size 150
# Ollama
š Average duration: 0.06s over 100 runs
š Average tokens used: 448
$ ./test.py --words-size 150
# LLaMA Box
š Average duration: 0.01s over 100 runs
š Average tokens used: 448
Now, you can see LLaMA Box is 1/6 time cost of Ollama.