phalexo

Results 137 comments of phalexo

Not sure what you mean by storing token. If you navigate to Hugging Face and try to access Llama models it will ask you to go through a quick process....

> hey there, does this happend during troubleshoot or debugging ? Not sure what the difference is between "troubleshoot" versus debugging. It certainly happens often when it "asks" me if...

You see how here it realizes that it created a bunch of unnecessary files, with duplicated code? ```bash Dev step 206 After reviewing the task implementation, I found some issues...

Ollama has a history file in the ~/.ollama folder. Does ollama constantly parse that cache?

> The default `mixtral` Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the `q3_K_S` model can be completely offloaded...

I was wondering if there is any indication that someone is looking into this? Also, I am wondering what effect LLAMA_CUDA_FORCE_MMQ=on setting has on the performance. If the optimized cuBLAS...

I don't think we should confuse two separate problems. Sometimes there is really not enough VRAM. Sometimes you run into cuBLAS 15 error which was introduced starting with v0.1.12. Which...

I am running on the Maxwell architecture. ```bash git clone --recursive https://github.com/jmorganca/ollama.git cd ollama/llm/llama.cpp vi generate_linux.go ``` ```go //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake...

You have to rebuild with LLAMA_CUDA_FORCE_MMQ=on performance may be a bit worse, but it would work. On Tue, Dec 19, 2023, 9:09 AM Richard Sun ***@***.***> wrote: > The same...