Robert Sinclair
Robert Sinclair
Hello, usually when quantizing I first convert a huggingface model to F16 gguf then I quantize that to my quantizations. I have noticed that convert does not produce a "pure"...
`llama-cli -c 1024 -t 6 -m codegeex4-all-9b.q4_k.gguf -p "You are my assistant." -e -cnv --chat-template chatml` ``` == Running in interactive mode. == - Press Ctrl+C to interject at any...
## 🚀 Feature allow f32q5_k and f16q5_k quantizations ## Motivation from my tests, f16 (for output and embed tensors) and q5_k_m for the others is the best quantization. ## Alternatives...
connecting to 127.0.0.1:8376 gives 404 error..
``` running build_ext No `pyyaml_build_config` setting found. error: [WinError 2] The system cannot find the file specified [end of output] note: This error originates from a subprocess, and is likely...
If I try to send a file to gemini, I get "The current model Google (gemini-1.5-flash-latest) does not support sending files". But gemini indeed supports sending files like audio, video...
``` Starting LOLLMS Web UI... ___ ___ ___ ___ ___ ___ /\__\ /\ \ /\__\ /\__\ /\__\ /\ \ /:/ / /::\ \ /:/ / /:/ / /::| | /::\...
I did everything in the README. (I am using CPU only) When running python app.py I get: AssertionError: Torch not compiled with CUDA enabled
Let's say I wish to merge LLAMA-3-8B with Mistral 7B creating a MOE. How should I proceed? Or for example 2 small models (3B/4B) of different architectures.
Considering I have metered internet and not so great resources, I followed your guind and the notebook. I used this yaml: ``` slices: - sources: - model: mistralai/Mistral-7B-Instruct-v0.3 layer_range: [0,...