openchat
openchat copied to clipboard
Inference using CPU
Hi all, is it possible to do inference (i.e., chat) using CPU?
I tried setting the following but it did not work export CUDA_VISIBLE_DEVICES="" && python -m ochat.serving.openai_api_server --model openchat/openchat_3.5
@stevenwong You should probably check out llama.cpp for this. It supports the GGUF version of OpenChat that was uploaded by TheBloke.
hi I did exactly this and it works at a reasonable speed on CPU
here's my code using a downloaded gguf model
from llama_cpp import Llama
llm = Llama(
model_path="/home/sujit/Downloads/text-generation-webui-main/models/TheBloke_openchat-3.5-0106-GGUF/openchat-3.5-0106.Q4_K_M.gguf", # Download the model file first
n_ctx=1024, n_threads=8)
while True:
prompt=input("\nUser: ")
output = llm.create_chat_completion(
messages=[{ "role": "user", "content": "You are a personal assistant." },
{ "role": "assistant","content": prompt}],
stream=True
)
for chunk in output:
delta = chunk['choices'][0]['delta']
if 'role' in delta:
print(delta['role'], end=': ')
elif 'content' in delta:
if delta['content'] !="":
print(delta['content'], end='')