LocalAIVoiceChat icon indicating copy to clipboard operation
LocalAIVoiceChat copied to clipboard

Couqi Engine takes brakes mid sentence to load.

Open tomwarias opened this issue 1 year ago • 6 comments

Couqi Engine takes brakes mid sentence to load. IT takes sometimes between words or even in the middle of say the word. I tried to adjust setting but nothing works. I use i7 10th and RTX3060 computer.

tomwarias avatar Feb 25 '24 22:02 tomwarias

Your GPU should be fast enough for realtime. Is pytorch installed with CUDA?

KoljaB avatar Feb 25 '24 23:02 KoljaB

Yes i followed everystep of the readme. I may have problem with cuda because my gpu isn't used by llm model also but dont know how to solve it. I use windows

tomwarias avatar Feb 26 '24 16:02 tomwarias

I guess pytorch has no CUDA support. Please check with:

print(torch.cuda.is_available())

If not available, please try to install the latest torch with CUDA version with:

pip install torch==2.2.0+cu118 torchaudio==2.2.0+cu118 --index-url https://download.pytorch.org/whl/cu118

(may need to adjust 118 to your CUDA version, this is for CUDA 11.8)

To use GPU with LLM under windows you need to compile llama-cpp-python for CUBLAS:

  • Set environment variables:
    set CMAKE_ARGS=-DLLAMA_CUBLAS=on
    set FORCE_CMAKE=1
    
  • Also it may be needed to copy all four MSBuildExtensions files based on your CUDA version (11.8 or 12.3) from:
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\extras\visual_studio_integration\MSBuildExtensions   
    
    to
    C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations
    

After that install and compile llama-cpp with:

pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

After that you can set n_gpu_layers in the creation parameters of llama.cpp to define how many layers of the llm neural network should be offloaded on the GPU.

KoljaB avatar Feb 26 '24 17:02 KoljaB

I did it and it still does that, and I am also unable to dowland llama_cpp on those set CMAKE_ARGS=-DLLAMA_CUBLAS=on

tomwarias avatar Feb 26 '24 22:02 tomwarias

What's the result of print(torch.cuda.is_available())? Both torch and llama.cpp have to run with CUDA (GPU supported) to achieve realtime speed.

The above installation way for llama.cpp works for on my Windows 10 system, if it fails on yours I'm not sure how I can offer further support. llama.cpp not my library and it can be a complex issue.

KoljaB avatar Feb 27 '24 07:02 KoljaB