llamafile
llamafile copied to clipboard
Documentation for other llamafile binaries
Compiling llamafile produces these binaries, they need more documentation than what --help provides.
llamafile llamafile-perplexity llamafile-quantize llamafile-server llava-quantize zipalign
For example, what does llamafile-perplexity actually do and how is it different from llamafile or llamafile-server? I would be happy to help with docs.
Was thinking about a doc for developers approaching from an application perspective in https://github.com/Mozilla-Ocho/llamafile/issues/168. Is yours towards the 'model packager/maintainers'? If so what would you propose the file to be? Would it be something like AIPUBLISHER.md
? The readme should of course have the quickstart, but maybe that doc can then go into further details.
The llamafile binaries need user documentation with examples. The 4.1 branch man/doc page formatting is an improvement over the 4.0 but it needs to be fleshed out for developers/users. I looked through the llama.cpp examples and they're bare bones as well, e.g., llamafile-quantize --help is thin:
sophiaparafina@Sophias-Mac-mini llamafile % llamafile-quantize --help
usage: /usr/local/bin/llamafile-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] model-f32.gguf [model-quant.gguf] type [nthreads]
--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
--pure: Disable k-quant mixtures and quantize all tensors to the same type
Allowed quantization types:
2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B
3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B
8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B
9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B
10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B
12 or Q3_K : alias for Q3_K_M
11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B
12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B
13 or Q3_K_L : 3.35G, +0.1764 ppl @ LLaMA-v1-7B
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B
15 or Q4_K_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 4.33G, +0.0400 ppl @ LLaMA-v1-7B
17 or Q5_K_M : 4.45G, +0.0122 ppl @ LLaMA-v1-7B
18 or Q6_K : 5.15G, -0.0008 ppl @ LLaMA-v1-7B
7 or Q8_0 : 6.70G, +0.0004 ppl @ LLaMA-v1-7B
1 or F16 : 13.00G @ 7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing
What I would like to see is documentation along the lines of the MDN Web Docs Project and How-To Guides. I can make a stab at it and submit a PR, but it would need technical and editorial review.
The zipalign
command has a pretty good man
page because I wrote it myself.
zipalign(1) General Commands Manual zipalign(1)
NAME
zipalign – PKZIP for LLMs
SYNOPSIS
zipalign [FLAG...] ZIP FILE...
DESCRIPTION
zipalign adds aligned uncompressed files to a PKZIP archive.
This tool is designed to concatenate gigabytes of LLM weights to an
executable. This command goes 10x faster than `zip -j0`. Unlike zip you
are not required to use the .com file extension for it to work. But most
importantly, this tool has a flag that lets you insert zip files that are
aligned on a specific boundary. The result is things like GPUs that have
specific memory alignment requirements will now be able to perform math
directly on the zip file's mmap()'d weights.
This tool always operates in an append-only manner. Unlike the InfoZIP
zip(1) command, zipalign does not reflow existing assets to shave away
space. For example, if zipalign is used on an existing PKZIP archive to
replace an existing asset, then the bytes for the old revision of the
asset will still be there, along with any alignment gaps that currently
exist in the file between assets.
The same concept also applies to the central directory listing that's
stored at the end of the file. When changes are made, the old central
directory is left behind as junk data. Therefore it's important, when
adding multiple files to an archive at once, that the files all be passed
in arguments at once, rather than calling this command multiple times.
OPTIONS
The following options are available:
-h Show help.
-v Operate in verbose mode.
-N Run in nondeterministic mode. This will cause the date/time of
inserted assets to reflect the file modified time.
-a INT Byte alignment for inserted zip assets. This must be a two power.
It defaults to 65536 since that ensures your asset will be page-
aligned on all conceivable platforms, both now and in the future.
-j Strip directory components. The filename of each input filepath
will be used as the zip asset name. This is otherwise known as
the basename. An error will be raised if the same zip asset name
ends up being specified multiple times.
-0 Store zip assets without compression. This is the default. This
option must be chosen when adding weights to a llamafile,
otherwise it won't be possible to map them into memory. Using -0
goes orders of a magnitude faster than using -6 compression.
-6 Store zip assets with sweet spot compression. Any value between
-0 and -9 is accepted as choices for compression level. Using -6
will oftentimes go 10x faster than -9 and only has a marginal
increase of size. Note uncompression speeds are unaffected.
-9 Store zip assets with the maximum compression. This takes a very
long time to compress. Uncompression will go just as fast. This
might be a good idea when publishing archives that'll be widely
consumed via the Internet for a long time.
SEE ALSO
unzip(1), llamafile(1)
AUTHORS
Justine Alexandra Roberts Tunney ⟨[email protected]⟩
Linux 5.15.131-0-lts December 5, 2023 Linux 5.15.131-0-lts
I'm going to push a change in a moment so that when you say zipalign --help
it'll show that man page.
I've just made the --help
flag much more helpful for each program. Every llamafile will now be able to display the rendered man page when that flag is passed in the terminal or console.
For example, what does llamafile-perplexity actually do and how is it different from llamafile or llamafile-server? I would be happy to help with docs.
Good question. I don't think I understand it, but I've made my best pass at improving the manual. I know the llama.cpp authors will break determinism if it means perplexity scores go down. One thing I've seen they like to measure is how well the software and model are together able to reproduce content from wikipedia. The closer to being a verbatim regurgitation, the better the perplexity score will be. That helps sort out the subjective differences of opinion w.r.t. breaking changes. It can also give you a yardstick that tells you, by how much, a quantization format is compromising information retrieval.
LLAMAFILE-PERPLEXITY(1) General Commands Manual LLAMAFILE-PERPLEXITY(1)
NAME
llamafile-perplexity - LLM benchmarking tool
SYNOPSIS
llamafile-perplexity [flags...]
DESCRIPTION
Perplexity is one of the most common metrics for evaluating language
models. The llamafile-perplexity program can be used to gauge the quality
of an LLM implementation. It is defined as the exponentiated average
negative log-likelihood of a sequence, calculated with exponent base e.
OPTIONS
The following options are available:
-h, --help
Show help message and exit.
-m FNAME, --model FNAME
Model path (default: models/7B/ggml-model-f16.gguf)
-f FNAME, --file FNAME
Raw data input file.
-t N, --threads N
Number of threads to use during generation (default: nproc/2)
-s SEED, --seed SEED
Random Number Generator (RNG) seed (default: -1, use random seed
for < 0)
EXAMPLE
One dataset commonly used in the llama.cpp community for measuring
perplexity is wikitext-2-raw. To use it when testing how well both your
model and llamafile are performing you could run the following:
wget https://cosmo.zip/pub/datasets/wikitext-2-raw/wiki.test.raw
llamafile-perplexity -m model.gguf -f wiki.test.raw -s 31337
This can sometimes lead to surprising conclusions, like how Q5 weights
might be better for a particular model than Q6.
SEE ALSO
llamafile(1)
Llamafile Manual December 5, 2023 Llamafile Manual
You're encouraged to contribute to the documentation too! If you're willing to learn troff, the authoritative documentation files end with a .1
extension. That's what gets turned into the PDF / PostScript / man
documentation in our releases. The most important file is here: https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/llamafile.1 The Troff syntax is explained here: https://manpages.ubuntu.com/manpages/trusty/pt/man7/mdoc.samples.7.html If you're unfamiliar with Troff, then just send us a pull request editing the README markdown. I'll convert it to Troff for you!
Oh last thing worth mentioning. Here's your dev cycle if you edit the .1
Troff manuals:
nano llamafile/llamafile.1
man llamafile/llamafile.1
The way I'm converting them to PDF is:
groff -Tps -man llamafile/llamafile.1 >llamafile/llamafile.ps
ps2pdf llamafile/llamafile.ps llamafile/llamafile.pdf
Thank you for fixing the man pages. However, llamafile would benefit from user docs with a similar level of detail found on llama.cpp, i.e., https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md. I'm working my way through the options and adding annotations. I'll submit a PR for review when I'm done.
Oh good point. I missed that README. Yes we can add that content.
If we copy and paste all that content, then we should add llama.cpp's MIT notice to the top of the llamafile.1 file.
@spara How long do you want me to hold off on adding that README content to the Troff file? Could I get an ETA on your PR?