llama-cpp-python
llama-cpp-python copied to clipboard
How to use GPU?
I run llama cpp python on my new PC which has a built in RTX 3060 with 12GB VRAM This is my code:
from llama_cpp import Llama
llm = Llama(model_path="./wizard-mega-13B.ggmlv3.q4_0.bin", n_ctx=2048)
def generate(params):
print(params["promt"])
output = llm(params["promt"], max_tokens=params["max_tokens"], stop=params["stop"], echo=params["echo"])
This code works and I get the results that I want but the inference is terribly slow. for a few tokens it takes up to 10 seconds. How do I minimize this time? I dont think my GPU is doing the heavy lifting here...