openchat Inference using CPU

Hi all, is it possible to do inference (i.e., chat) using CPU?

I tried setting the following but it did not work export CUDA_VISIBLE_DEVICES="" && python -m ochat.serving.openai_api_server --model openchat/openchat_3.5

Nov 27 '23 02:11 stevenwong

@stevenwong You should probably check out llama.cpp for this. It supports the GGUF version of OpenChat that was uploaded by TheBloke.

Dec 12 '23 17:12 baughmann

hi I did exactly this and it works at a reasonable speed on CPU

here's my code using a downloaded gguf model

from llama_cpp import Llama

llm = Llama(
  model_path="/home/sujit/Downloads/text-generation-webui-main/models/TheBloke_openchat-3.5-0106-GGUF/openchat-3.5-0106.Q4_K_M.gguf",  # Download the model file first
  n_ctx=1024,  n_threads=8)

while True:
    prompt=input("\nUser: ")

    output = llm.create_chat_completion(
        messages=[{ "role": "user", "content": "You are a personal assistant." },
            { "role": "assistant","content": prompt}],
        stream=True
    )

    for chunk in output:
        delta = chunk['choices'][0]['delta']
        if 'role' in delta:
            print(delta['role'], end=': ')
        elif 'content' in delta:
            if delta['content'] !="":
                print(delta['content'], end='')

Apr 18 '24 00:04 sujitvasanth