llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

How to use GPU?

Open imwide opened this issue 1 year ago • 21 comments

I run llama cpp python on my new PC which has a built in RTX 3060 with 12GB VRAM This is my code:

from llama_cpp import Llama
llm = Llama(model_path="./wizard-mega-13B.ggmlv3.q4_0.bin", n_ctx=2048)
def generate(params):
    print(params["promt"])
    output = llm(params["promt"], max_tokens=params["max_tokens"], stop=params["stop"], echo=params["echo"])

This code works and I get the results that I want but the inference is terribly slow. for a few tokens it takes up to 10 seconds. How do I minimize this time? I dont think my GPU is doing the heavy lifting here...

imwide avatar Aug 05 '23 20:08 imwide