gpustack Performance differences between Ollama and gpustack when running embedding model

I have configured Ollama and GPustack to run the bge-m3 model. The bge-m3 model running on GPustack was also downloaded from Ollama and is executed using vLLM or llama-box. However, I noticed that when calling the bge-m3 model on GPustack, the GPU computing resources are not fully utilized, with utilization below 20%, whereas when calling the bge-m3 model on Ollama, the GPU utilization reaches over 80%.

I have confirmed that I am calling the same model in both cases. I also tried using different vector models and specifying parameters such as quantization, max-num-batched-tokens, kv-cache-dtype, and max-num-seqs when launching the model, but none of them worked.

Has anyone encountered this issue before? Could you share your thoughts and solutions with me? I would really appreciate it.

Mar 02 '25 15:03 wyanghu

Although vllm supports GGUF format, in GPUStack we only support llama-box with GGUF models. Besides the GPU utilization, did you see any other difference (total time cost for the same embedding) between the Ollama and GPUStack?

Mar 03 '25 08:03 pengjiang80

Test script

import openai
import random
import nltk
import time
from nltk.corpus import words

nltk.download('words')

client = openai.OpenAI(
    api_key="fake",
)

model_name = "bge-m3"

# Get a list of n random English words from the NLTK corpus
def get_random_words_from_dictionary(n=2000):
    word_list = words.words()
    return random.sample(word_list, n)

# Request embeddings from the model and measure time taken
def get_embeddings(text_list, model=model_name):
    input_text = " ".join(text_list)
    start_time = time.time()
    response = client.embeddings.create(model=model, input=input_text)
    end_time = time.time()
    duration = end_time - start_time
    embeddings = [item.embedding for item in response.data]
    return embeddings, duration, response.usage.total_tokens

# Run multiple embedding requests and calculate average duration and tokens
def run_benchmark(base_url, runs=10):
    print(f"\n🚀 Testing model from {base_url}")
    client.base_url = base_url
    durations = []
    tokens_used = []

    for i in range(runs):
        print(f"🔁 Run {i + 1}/{runs}")
        words_list = get_random_words_from_dictionary(2000)
        _, duration, tokens = get_embeddings(words_list)
        print(f"⏱️ Duration: {duration:.2f}s, Tokens used: {tokens}")
        durations.append(duration)
        tokens_used.append(tokens)

    avg_time = sum(durations) / len(durations)
    avg_tokens = sum(tokens_used) / len(tokens_used)
    print(f"\n📊 Average duration: {avg_time:.2f}s over {runs} runs")
    print(f"📉 Average tokens used: {avg_tokens:.0f}")
    return avg_time, avg_tokens

if __name__ == "__main__":
    base_url = "http://192.168.1.100:11434/v1"

    run_benchmark(base_url, runs=100)

Configurations are aligned to np=1, ctx-size=8192

Results

# 1-concurrency,1-batch

Ollama(0.9.1)
📊 Average duration: 0.30s over 100 runs
📉 Average tokens used: 5960

llama-box(v0.0.154)
📊 Average duration: 0.63s over 100 runs
📉 Average tokens used: 5965

llama-server(b5686)
📊 Average duration: 0.75s over 100 runs
📉 Average tokens used: 5966

Jun 17 '25 10:06 gitlawr

1. Preparation

Adjust the script provided by @gitlawr with the following changes:

Add a warmup round to eliminate boundary impact
Parameterize base url, model name and words size.

import openai
import random
import nltk
import time
import argparse
from nltk.corpus import words

nltk.download('words')

client = openai.OpenAI(
    api_key="fake",
)

# Get a list of n random English words from the NLTK corpus
def get_random_words_from_dictionary(n):
    word_list = words.words()
    return random.sample(word_list, n)

# Request embeddings from the model and measure time taken
def get_embeddings(word_list, model):
    input_text = " ".join(word_list)
    start_time = time.time()
    response = client.embeddings.create(model=model, input=input_text)
    end_time = time.time()
    duration = end_time - start_time
    embeddings = [item.embedding for item in response.data]
    return embeddings, duration, response.usage.total_tokens

# Run multiple embedding requests and calculate average duration and tokens
def run_benchmark(base_url, runs, model_name, words_size):
    print(f"\n🚀 Testing {model_name} model from {base_url}")
    client.base_url = base_url
    durations = []
    tokens_used = []

    # warm up
    for i in range(3):
        print(f"🔁 Warmup {model_name} {i + 1}/{runs}")
        word_list = get_random_words_from_dictionary(words_size)
        get_embeddings(word_list, model=model_name)

    for i in range(runs):
        print(f"🔁 Run {model_name} {i + 1}/{runs}")
        word_list = get_random_words_from_dictionary(words_size)
        _, duration, tokens = get_embeddings(word_list, model=model_name)
        print(f"⏱️ Duration: {duration:.2f}s, Tokens used: {tokens}")
        durations.append(duration)
        tokens_used.append(tokens)

    avg_time = sum(durations) / len(durations)
    avg_tokens = sum(tokens_used) / len(tokens_used)
    print(f"\n📊 Average duration: {avg_time:.2f}s over {runs} runs")
    print(f"📉 Average tokens used: {avg_tokens:.0f}")
    return avg_time, avg_tokens

if __name__ == "__main__":
    parser = argparse.ArgumentParser(exit_on_error=False, allow_abbrev=False)
    parser.set_defaults(base_url='http://127.0.0.1:8080/v1', model_name='bge-m3', words_size=2000)
    parser.add_argument('--base-url', type=str)
    parser.add_argument('--model-name', type=str)
    parser.add_argument('--words-size', type=int)

    args = parser.parse_args()

    run_benchmark(args.base_url, runs=100, model_name=args.model_name, words_size=args.words_size)

2. Launch Ollama / LLaMA Box

Ollama (lock to one Nvidia GPU, with 8192 context size and 1 parallel size)

$ CUDA_VISIBLE_DEVICES=0 \
    OLLAMA_CONTEXT_LENGTH=8192 \
    OLLAMA_NUM_PARALLEL=1 \
    ollama serve

$ # wake up model
$ curl localhost:11434/v1/embeddings -d '{"model":"bge-m3","input":"why is the sky blue?","encoding_format":"float"}'

LLaMA Box (lock to one Nvidia GPU, with 8192 context size and 1 parallel size)

$ CUDA_VISIBLE_DEVICES=1 \
    llama-box --embeddings -ngl 99 --host 0.0.0.0 --port 8080 \
    -np 1 -c 8192 \
    -m gpustack/bge-m3-GGUF/bge-m3-FP16.gguf

3. Run Testing Script

$ ./test.py --base-url http://127.0.0.1:11434/v1
# Ollama
📊 Average duration: 0.22s over 100 runs
📉 Average tokens used: 5963

$ ./test.py
# LLaMA Box
📊 Average duration: 0.58s over 100 runs
📉 Average tokens used: 5967

A 2000 random word list creates almost 5800-6100 tokens per request.

Ollama almost uses 1/3 time cost of LLaMA box.

4. Investige

Both Ollama and LLaMA Box are based on LLaMA.CPP, what causes the double gap?

Let's look at the launch logs.

Ollama

LLaMA Box

LLaMA.CPP introduces a batch logic to process the token decoding/encoding. The following pseudo-code shows how these two batch sizes coordinate:

for i = 0; i < n_batch; i+= n_ubatch:
    batch_size = min(i+n_ubatch, n_batch)
    <submit batch_szie tokens to process>

Ollama only did 1/4 work of LLaMA Box, which means Ollama only takes 512 tokens from 5800-6100 request tokens here, resulting in an incomplete embedding, see https://github.com/gpustack/gpustack/issues/950#issuecomment-2598210455.

LLaMA Box forces locking n_batch = n_ubatch = n_ctx for embedding model, see https://github.com/ggml-org/llama.cpp/pull/13076.

5. The True Comparison

Yes, make the request tokens under 512, almost 150 words.

$ ./test.py --base-url http://127.0.0.1:11434/v1 --words-size 150
# Ollama
📊 Average duration: 0.06s over 100 runs
📉 Average tokens used: 448

$ ./test.py --words-size 150
# LLaMA Box
📊 Average duration: 0.01s over 100 runs
📉 Average tokens used: 448

Now, you can see LLaMA Box is 1/6 time cost of Ollama.

Jul 03 '25 02:07 thxCode