llama-cpp-python
llama-cpp-python copied to clipboard
How to improve GPU utilization
I've noticed that the GPU utilization is very low during model inference, with a maximum of only 80%, but I want to increase the GPU utilization to 99%. How can I adjust the parameters? GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:8A:00.0 Off | 0 | | N/A 66C P0 205W / 250W | 14807MiB / 40960MiB | 78% Default
N_THREADS = multiprocessing.cpu_count()
self.runner = Llama(
model_path=self.model_name,
n_gpu_layers=-1,
chat_format=self.generating_args["chat_format"],
tokenizer=self.llama_tokenizer,
flash_attn=True,
verbose=False,
n_ctx=1024,
n_threads=N_THREADS // 2,
n_threads_batch=N_THREADS
)
x = runner.create_chat_completion(
messages=messages,
top_p=0.0,
top_k=1,
temperature=1,
max_tokens=512,
seed=1337
)
Originally posted by @xiangxinhello in https://github.com/abetlen/llama-cpp-python/issues/1669#issuecomment-2277577719
how to run llamacpp python on gpu intel?
Hi, any update on this? because we couldn't get over 33%.
Hi, any update on this? because we couldn't get over 33%.
Hi did you find any update on it? Kinda stuck in the same boat, the gpu memory usage is over 90% but the gpu-utilization is at max 35%. Tried increasing batch size, still the same