llama.cpp
llama.cpp copied to clipboard
LLM inference in C/C++
On machines with smaller memory and slower processors, it can be useful to reduce the overall number of threads running. For instance on my MacBook Pro Intel i5 16Gb machine,...
Per [this twitter thread](https://twitter.com/theshawwn/status/1632569215348531201). See commit [here](https://github.com/shawwn/llama/commit/40d99d329a5e38d85904d3a6519c54e6dd6ee9e1).
Hey! Thank you for your amazing job! I'm curious is it possible to use RLHF feedback after a response to make small incremental adjustments in a tuning process? For example,...
The initial `make` fails with `CLOCK_MONOTONIC undeclared` ``` I llama.cpp build info: I UNAME_S: Linux I UNAME_P: unknown I UNAME_M: x86_64 I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx...
I can achieve around 1 token per second on a Ryzen 7 3700X on Linux with the 65B model and 4bit quantization. If we use 8bit instead, would it run...
benchmarks?
Where are the benchmarks for various hardware - eg. apple silicon
First of all thremendous work Georgi! I managed to run your project with a small adjustments on: - Intel(R) Core(TM) i7-10700T CPU @ 2.00GHz / 16GB as x64 bit app,...
This would be the initial PR to be able to compile stuff in Windows. In particular, MSVC is very picky about the features you can use and you cannot. With...
This prompt with the 65B model on an M1 Max 64GB results in a segmentation fault. Works with 30B model. Are there problems with longer prompts? Related to #12 ```...
The `./main` program currently outputs text and then quits. How hard would it be to add a mode where it could stay running and be ready to accept more text...