klosax
klosax
This is the first step to realize a unified llm API and interface and that would handle any supported architecture. https://github.com/ggerganov/llama.cpp/issues/1602#issuecomment-1568215353 https://github.com/ggerganov/ggml/issues/185 https://github.com/ggerganov/ggml/pull/145#issuecomment-1544733902
> > `general.architecture: String`: describes what architecture this model implements. Values can include llama, mpt, gpt-neox, gpt-j, gpt-2, bloom, etc. > > It might make more sense to make something...
> That being said, that reminds me - it might be a good idea to include suggested prompt formats as one of the standardised config parameters. Feel free to +1...
> vocabulary.huggingface_tokenizer_json: String: the entirety of the HF tokenizer.json for a given model .. Optional, but highly recommended for best tokenization quality with supported executors. Why would json give a...
> I wasn't aware of the existence of other ways to store the tokenization data, and I'd have to look into it. Do you have any further information about it...
Using a 7B model the freeze is about 5 seconds, 30B model 20 seconds. I tried using --no-map with the 30B model and the system froze for 5 minutes(!) right...
The prompt eval time is 2.5 times slower also: Release 305eb5a output: ``` ./main -m ../llama-33b-supercot-ggml-q5_1.bin -c 2048 -p "Hiking is" -n 16 -t 6 main: seed = 1682775656 llama.cpp:...
Thanks. So it seems to be related to Ubuntu and / or AMD cpus. I'm running Ubuntu 20.04 and have an AMD Ryzen 5 cpu.
I found out what the problem is. The model did not fit into RAM. When using the b1ee8f5 release it works even if the model dont fit in RAM, but...
Maybe implement a parameter to not use pinned memory, as the previous version did work fine on swapped memory.