Georgi Gerganov

Results 420 comments of Georgi Gerganov

Currently, the only way is to manually replace these strings yourself (for example, using regex). Btw, `-ac 768` is better than `-ac 750` - you want the number to be...

Before merging this: the current `Q4_3` format / implementation is not very efficient with ARM NEON: Time per token on M1 Pro: - `Q4_0` : `48ms` - `Q4_1` : `55ms`...

Cool stuff! Here is a sample run on M2 Ultra: ```bash $ ▶ ./sd -m ../models/sd-v1-4-ggml-model-f16.bin -p "a lovely cat" -t 12 [INFO] stable-diffusion.cpp:2191 - loading model from '../models/sd-v1-4-ggml-model-f16.bin' [INFO]...

Try this patch: https://github.com/ggerganov/llama.cpp/commit/6460f758dbd472653296044d36bed8c4554988f5

On `master` with `Accelerate` I get: ```bash make clean && LLAMA_NO_METAL=1 make -j && ./llama-bench -m models/mistral-7b-v0.2/ggml-model-fp16.gguf -m models/mistral-7b-v0.2/ggml-model-q8_0.gguf -m models/mistral-7b-v0.2/ggml-model-q4_0.gguf -ngl 0 -n 0 ``` | model | size...

Yes, this assert has to be avoided. The Command-R model has a very large output tensor and it's number of elements exceeds `int`. That's why in order to support it,...

Here are instruction to trigger this assert: - clone https://huggingface.co/CohereForAI/c4ai-command-r-plus ```bash # convert to GGUF python3 convert-hf-to-gguf.py ~/Data/huggingface/c4ai-command-r-plus/ --outfile models/command-r-plus/ggml-model-f16.gguf --outtype f16 # quantize to Q8_0 + F16 token embeddings...

Use the new links from the README which were updated like an hour or two ago Edit: nvm - I see you are using them. I guess these are the...

I think this script should help, but not sure: https://github.com/ggerganov/llama.cpp/issues/324#issuecomment-1476227818

Great task for a `llama.cpp` example! Btw, this is along the lines of the constrained Whisper sampling idea for chess moves: https://twitter.com/ggerganov/status/1640441536403116032 I think this will be another very cool...