llama.cpp issues

Q4_1 inference appears broken for 13B parameters

3

I have been experimenting with q4_1 quantisation (since [some preliminary results](https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and) suggest it shold perform better), and noticed that something about the pipeline for the 13B parameter model is broken...

blackhole89

bug

Added magnet link to download model files

1

See this issue: https://github.com/facebookresearch/llama/pull/73

JonnoFTW

Ability to take in a config file as initial prompt

4

Following on to the "Store preprocessed prompts", it would be good to be able to take in a text file with a generic prompt & flags to start a chatbot...

MLTQ

enhancement

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 717778556, available 454395136)

Hey, I know someone already posted a similar issue that has already been closed, but I ran into the same thing. On windows 10 and cloned just yesterday

YongeBai

duplicate

need more info

mmap() - backed istream implementation

1

This is for issue #91. Treat this as a first draft. There are definitely some thing that need to be changed and will be changed shortly. I have not benchmarked....

apaz-cli

fixed warning with std::ignore about unused function result

Fixes scanf unused result compile warning.

RazeLighter777

added ctx_size parameter

3

Adds a parameter called context size (-c for short) that allows taking the context size from the user's input. Defaults to the same hardcoded 512.

RazeLighter777

fixed color reset on exit

Fixes the color messing up the terminal when the program exits by printing an ANSI_COLOR_RESET. Includes it in the SIGINT handler too.

RazeLighter777

It appears context memory usage can be trivially halved by using fp16?

I'm not fully familiar with this codebase, so pardon if I'm wrong. My first attempt to modify the code was to expand hardcoded context window of 512 to 4096 but...

jarcen

enhancement

Use `tokenizer.vocab_size()` instead of hardcoding 32000 when converting

When converting the model + tokenizer, use the vocabulary size returned by the tokenizer rather than assuming 32000. There are ways that special tokens or other new tokens could be...

Ronsor

llama.cpp
llama.cpp copied to clipboard

Metadata

Q4_1 inference appears broken for 13B parameters

Added magnet link to download model files

Ability to take in a config file as initial prompt

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 717778556, available 454395136)

mmap() - backed istream implementation

fixed warning with std::ignore about unused function result

added ctx_size parameter

fixed color reset on exit

It appears context memory usage can be trivially halved by using fp16?

Use `tokenizer.vocab_size()` instead of hardcoding 32000 when converting

← Metadata

Owner

Metadata

llama.cpp llama.cpp copied to clipboard

Metadata

← Metadata

Owner

Metadata

llama.cpp
llama.cpp copied to clipboard