llama-cpp-python How to improve GPU utilization

How to improve GPU utilization

Open xiangxinhello opened this issue 1 year ago • 3 comments

    N_THREADS = multiprocessing.cpu_count()
    self.runner = Llama(
        model_path=self.model_name,
        n_gpu_layers=-1,
        chat_format=self.generating_args["chat_format"],
        tokenizer=self.llama_tokenizer,
        flash_attn=True,
        verbose=False,
        n_ctx=1024,
        n_threads=N_THREADS // 2,
        n_threads_batch=N_THREADS
    )
    x = runner.create_chat_completion(
        messages=messages,
        top_p=0.0,
        top_k=1,
        temperature=1,
        max_tokens=512,
        seed=1337
    )

Originally posted by @xiangxinhello in https://github.com/abetlen/llama-cpp-python/issues/1669#issuecomment-2277577719

Aug 12 '24 07:08 xiangxinhello

how to run llamacpp python on gpu intel?

Aug 27 '24 22:08 ayttop

Hi, any update on this? because we couldn't get over 33%.

Mar 11 '25 08:03 dogugun

Hi, any update on this? because we couldn't get over 33%.

Hi did you find any update on it? Kinda stuck in the same boat, the gpu memory usage is over 90% but the gpu-utilization is at max 35%. Tried increasing batch size, still the same

Jun 13 '25 09:06 Tejasram-mips

llama-cpp-python llama-cpp-python copied to clipboard

How to improve GPU utilization

llama-cpp-python
llama-cpp-python copied to clipboard