phalexo
phalexo
Not sure what you mean by storing token. If you navigate to Hugging Face and try to access Llama models it will ask you to go through a quick process....
> hey there, does this happend during troubleshoot or debugging ? Not sure what the difference is between "troubleshoot" versus debugging. It certainly happens often when it "asks" me if...
You see how here it realizes that it created a bunch of unnecessary files, with duplicated code? ```bash Dev step 206 After reviewing the task implementation, I found some issues...
Ollama has a history file in the ~/.ollama folder. Does ollama constantly parse that cache?
> The default `mixtral` Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the `q3_K_S` model can be completely offloaded...
I was wondering if there is any indication that someone is looking into this? Also, I am wondering what effect LLAMA_CUDA_FORCE_MMQ=on setting has on the performance. If the optimized cuBLAS...
I don't think we should confuse two separate problems. Sometimes there is really not enough VRAM. Sometimes you run into cuBLAS 15 error which was introduced starting with v0.1.12. Which...
I am running on the Maxwell architecture. ```bash git clone --recursive https://github.com/jmorganca/ollama.git cd ollama/llm/llama.cpp vi generate_linux.go ``` ```go //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake...
Do you have tensor cores on your GPU? I doubt it.
You have to rebuild with LLAMA_CUDA_FORCE_MMQ=on performance may be a bit worse, but it would work. On Tue, Dec 19, 2023, 9:09 AM Richard Sun ***@***.***> wrote: > The same...