ollama-benchmark [Feature Request] Pull model through Ollama API instead of invoking ollama binary

User story: I want to benchmark ollama running inside a docker container. I would prefer to install a venv or conda env with llm_benchmark in a different host or container.

Ollama API doc: https://github.com/ollama/ollama/blob/main/docs/api.md#pull-a-model

Jul 04 '24 01:07 yeahdongcn

I sent a PR to query device information through Ollama API: https://github.com/ollama/ollama/pull/5479 It could be used to replace GPUtil to check the available VRAM.

Jul 04 '24 08:07 yeahdongcn

+1 I use ollama in a docker container and I'd like to test it via HTTP api

Nov 10 '24 19:11 grigio

I also was looking for a Way to use Docker/Podman for running the Benchmark.

However I took a different Approach and decided to build a custom Container Image with both ollama and llm_benchmark Repository Here. It contains the Standard Ollama Binary Installation.

Of course there are 100 Ways to go about it and Optimize the build Process, but I'd say it's a good Start :smile:.

EDIT 1: I tested the Python 3.12 Debian Bookworm Base on a Ubuntu 24.04 AMD64 System with an NVIDIA GTX 1060 6GB with podman Version 4.9.3 and podman-compose Version 1.2.0 . The proof of Concept works OK. Of course, if you have another instance of Ollama running or are doing heavy stuff on the GPU during the testing, that (obviously) will have an Impact on the Results.

EDIT 2: nothing forbids you to use this Image as your "Main" Ollama Server Image. Or using the same Bind-Mount Volume as your main Ollama Server, so you don't need to re-download the Models every Time. Just take Care that if you have 2 instances of Ollama running, then a) you do NOT use the same Port (for this Benchmark I actually disable the Network Port Mapping) and b) you do NOT have read&write access to the same Bind Mount at the same Time (not sure if anything should go wrong provided that you do NOT update the Models at the same Time, but better be safe than sorry)

EDIT 3: Run Results on my NVIDIA GTX 1060 6GB. Not sure how good that is, but the GPU Memory Usage went up quite a bit (to around 5GB-5.5GB) and the Temperature as well (around 70°C) during the Tests (as reported by nvidia-smi on the Host)

root@99ca81a705f7:/opt/app# llm_benchmark run --no-sendinfo
-------Linux----------
{'id': '0', 'name': 'NVIDIA GeForce GTX 1060 6GB', 'driver': '535.183.01', 'gpu_memory_total': '6144.0 MB', 'gpu_memory_free': '5647.0 MB', 'gpu_memory_used': '425.0 MB', 'gpu_load': '1.0%', 'gpu_temperature': '52.0°C'}
Only one GPU card
Total memory size : 31.30 GB
cpu_info: Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz
gpu_info: NVIDIA GeForce GTX 1060 6GB
os_version: "Debian GNU/Linux 12 (bookworm)"
ollama_version: 0.5.4
----------
LLM models file path：/opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/benchmark_models_16gb_ram.yml
Checking and pulling the following LLM models
phi3:3.8b
qwen2:7b
gemma2:9b
mistral:7b
llama3.1:8b
llava:7b
llava:13b
----------
model_name =    mistral:7b
prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch.
eval rate:            26.14 tokens/s
prompt = Develop a python function that solves the following problem, sudoku game
eval rate:            26.26 tokens/s
prompt = Create a dialogue between two characters that discusses economic crisis
eval rate:            27.06 tokens/s
prompt = In a forest, there are brave lions living there. Please continue the story.
eval rate:            25.92 tokens/s
prompt = I'd like to book a flight for 4 to Seattle in U.S.
eval rate:            27.50 tokens/s
--------------------
Average of eval rate:  26.576  tokens/s
----------------------------------------
model_name =    llama3.1:8b
prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch.
eval rate:            12.34 tokens/s
prompt = Develop a python function that solves the following problem, sudoku game
eval rate:            11.99 tokens/s
prompt = Create a dialogue between two characters that discusses economic crisis
eval rate:            12.46 tokens/s
prompt = In a forest, there are brave lions living there. Please continue the story.
eval rate:            12.30 tokens/s
prompt = I'd like to book a flight for 4 to Seattle in U.S.
eval rate:            12.80 tokens/s
--------------------
Average of eval rate:  12.378  tokens/s
----------------------------------------
model_name =    phi3:3.8b
prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch.
eval rate:            40.43 tokens/s
prompt = Develop a python function that solves the following problem, sudoku game
eval rate:            34.81 tokens/s
prompt = Create a dialogue between two characters that discusses economic crisis
eval rate:            42.20 tokens/s
prompt = In a forest, there are brave lions living there. Please continue the story.
eval rate:            38.68 tokens/s
prompt = I'd like to book a flight for 4 to Seattle in U.S.
eval rate:            42.36 tokens/s
--------------------
Average of eval rate:  39.696  tokens/s
----------------------------------------
model_name =    qwen2:7b
prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch.
eval rate:            24.51 tokens/s
prompt = Develop a python function that solves the following problem, sudoku game
eval rate:            23.28 tokens/s
prompt = Create a dialogue between two characters that discusses economic crisis
eval rate:            23.33 tokens/s
prompt = In a forest, there are brave lions living there. Please continue the story.
eval rate:            23.56 tokens/s
prompt = I'd like to book a flight for 4 to Seattle in U.S.
eval rate:            23.74 tokens/s
--------------------
Average of eval rate:  23.684  tokens/s
----------------------------------------
model_name =    gemma2:9b
prompt = Explain Artificial Intelligence and give its applications.
eval rate:            4.05 tokens/s
prompt = How are machine learning and AI related?
eval rate:            4.21 tokens/s
prompt = What is Deep Learning based on?
eval rate:            4.12 tokens/s
prompt = What is the full form of LSTM?
eval rate:            4.21 tokens/s
prompt = What are different components of GAN?
eval rate:            4.08 tokens/s
--------------------
Average of eval rate:  4.134  tokens/s
----------------------------------------
model_name =    llava:7b
prompt = Describe the image, /opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/img/sample1.jpg
eval rate:            29.07 tokens/s
prompt = Describe the image, /opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/img/sample2.jpg
eval rate:            28.78 tokens/s
prompt = Describe the image, /opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/img/sample3.jpg
eval rate:            28.70 tokens/s
prompt = Describe the image, /opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/img/sample4.jpg
eval rate:            29.22 tokens/s
prompt = Describe the image, /opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/img/sample5.jpg
eval rate:            28.63 tokens/s
--------------------
Average of eval rate:  28.88  tokens/s
----------------------------------------
model_name =    llava:13b
prompt = Describe the image, /opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/img/sample1.jpg
eval rate:            2.27 tokens/s
prompt = Describe the image, /opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/img/sample2.jpg
eval rate:            2.25 tokens/s
prompt = Describe the image, /opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/img/sample3.jpg
eval rate:            2.21 tokens/s
prompt = Describe the image, /opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/img/sample4.jpg
eval rate:            2.29 tokens/s
prompt = Describe the image, /opt/llm-benchmark/lib/python3.12/site-packages/llm_benchmark/data/img/sample5.jpg
eval rate:            2.10 tokens/s
--------------------
Average of eval rate:  2.224  tokens/s
----------------------------------------

Jan 01 '25 13:01 luckylinux

It would be useful to add in the README that it llm-benchmark doesn't support Ollama server API yet.

Feb 05 '25 08:02 abishekmuthian

I used vast.ai to do benchmark. It uses nvidia docker container. Here is the recorded video about how I do benchmark inside a launched docker container. https://youtu.be/WQMlad-rxiQ

Feb 05 '25 16:02 chuangtc

ollama-benchmark ollama-benchmark copied to clipboard

[Feature Request] Pull model through Ollama API instead of invoking ollama binary

ollama-benchmark
ollama-benchmark copied to clipboard