Mark Schmidt

Results 95 comments of Mark Schmidt

Does reducing top_p to something like 0.3 or even 0.1 provide better output for these larger models?

Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing. Do you have a plan...

I think that's a reasonable proposal @Dicklesworthstone. A purely 3-bit implementation of llama.cpp using GPTQ could retain acceptable performance and solve the same memory issues. There's an open issue for...

![llama.cpp on Samsung S22 Ultra at 1.2 tokens per second](https://user-images.githubusercontent.com/5949853/224798872-d3a1e9d8-d0ce-4261-b1a8-247c2a154a9f.png) 1.2 tokens/s on a Samsung S22 Ultra running 4 threads. The S22 obviously has a more powerful processor. But I...

Ah, yes. A 3-bit implementation of 7B would fit fully in 4GB of RAM and lead to much greater speeds. This is the same issue as in https://github.com/ggerganov/llama.cpp/issues/97. 3-bit support...

@octoshrimpy I believe Mestrace is saying you should convert and quantize the model on a desktop computer with a lot of RAM first, then move the ~4GB 4bit quantized mode...

Python Bindings for llama.cpp: https://pypi.org/project/llamacpp/0.1.3/ (not mine, just found them)

Since people in this thread are interested in Instruct models, I recommend checking out chatGLM-6B. I believe it is more capable than Flan-UL2 in just 6B parameters. I have a...

> @MarkSchmidty useful reference. Thanks From what I have observed GLM is mostly ignored due to it being weaker with English prompts. But it may turn out to be better...

That is the GPU memory required to run inference not the model size.  ![](https://user-images.githubusercontent.com/5949853/226741304-4fe963d8-3c42-4404-b761-f6fb3316a0fe.png) The official int4 model is 4.06GB on HuggingFace before any pruning. \>It would help if there...