llamafile Documentation for other llamafile binaries

Compiling llamafile produces these binaries, they need more documentation than what --help provides.

llamafile llamafile-perplexity llamafile-quantize llamafile-server llava-quantize zipalign

For example, what does llamafile-perplexity actually do and how is it different from llamafile or llamafile-server? I would be happy to help with docs.

Jan 03 '24 08:01 spara

Was thinking about a doc for developers approaching from an application perspective in https://github.com/Mozilla-Ocho/llamafile/issues/168. Is yours towards the 'model packager/maintainers'? If so what would you propose the file to be? Would it be something like AIPUBLISHER.md? The readme should of course have the quickstart, but maybe that doc can then go into further details.

Jan 03 '24 14:01 mofosyne

The llamafile binaries need user documentation with examples. The 4.1 branch man/doc page formatting is an improvement over the 4.0 but it needs to be fleshed out for developers/users. I looked through the llama.cpp examples and they're bare bones as well, e.g., llamafile-quantize --help is thin:

sophiaparafina@Sophias-Mac-mini llamafile % llamafile-quantize --help
usage: /usr/local/bin/llamafile-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type

Allowed quantization types:
   2  or  Q4_0   :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1   :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0   :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1   :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  10  or  Q2_K   :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K   :  5.15G, -0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0   :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16    : 13.00G              @ 7B
   0  or  F32    : 26.00G              @ 7B
          COPY   : only copy tensors, no quantizing

What I would like to see is documentation along the lines of the MDN Web Docs Project and How-To Guides. I can make a stab at it and submit a PR, but it would need technical and editorial review.

Jan 03 '24 15:01 spara

The zipalign command has a pretty good man page because I wrote it myself.

zipalign(1)                 General Commands Manual                zipalign(1)

NAME
     zipalign – PKZIP for LLMs

SYNOPSIS
     zipalign [FLAG...] ZIP FILE...

DESCRIPTION
     zipalign adds aligned uncompressed files to a PKZIP archive.

     This tool is designed to concatenate gigabytes of LLM weights to an
     executable. This command goes 10x faster than `zip -j0`. Unlike zip you
     are not required to use the .com file extension for it to work.  But most
     importantly, this tool has a flag that lets you insert zip files that are
     aligned on a specific boundary. The result is things like GPUs that have
     specific memory alignment requirements will now be able to perform math
     directly on the zip file's mmap()'d weights.

     This tool always operates in an append-only manner. Unlike the InfoZIP
     zip(1) command, zipalign does not reflow existing assets to shave away
     space. For example, if zipalign is used on an existing PKZIP archive to
     replace an existing asset, then the bytes for the old revision of the
     asset will still be there, along with any alignment gaps that currently
     exist in the file between assets.

     The same concept also applies to the central directory listing that's
     stored at the end of the file. When changes are made, the old central
     directory is left behind as junk data. Therefore it's important, when
     adding multiple files to an archive at once, that the files all be passed
     in arguments at once, rather than calling this command multiple times.

OPTIONS
     The following options are available:

     -h      Show help.

     -v      Operate in verbose mode.

     -N      Run in nondeterministic mode. This will cause the date/time of
             inserted assets to reflect the file modified time.

     -a INT  Byte alignment for inserted zip assets. This must be a two power.
             It defaults to 65536 since that ensures your asset will be page-
             aligned on all conceivable platforms, both now and in the future.

     -j      Strip directory components. The filename of each input filepath
             will be used as the zip asset name. This is otherwise known as
             the basename. An error will be raised if the same zip asset name
             ends up being specified multiple times.

     -0      Store zip assets without compression. This is the default. This
             option must be chosen when adding weights to a llamafile,
             otherwise it won't be possible to map them into memory. Using -0
             goes orders of a magnitude faster than using -6 compression.

     -6      Store zip assets with sweet spot compression. Any value between
             -0 and -9 is accepted as choices for compression level. Using -6
             will oftentimes go 10x faster than -9 and only has a marginal
             increase of size. Note uncompression speeds are unaffected.

     -9      Store zip assets with the maximum compression. This takes a very
             long time to compress. Uncompression will go just as fast. This
             might be a good idea when publishing archives that'll be widely
             consumed via the Internet for a long time.

SEE ALSO
     unzip(1), llamafile(1)

AUTHORS
     Justine Alexandra Roberts Tunney ⟨[email protected]⟩

Linux 5.15.131-0-lts           December 5, 2023           Linux 5.15.131-0-lts

I'm going to push a change in a moment so that when you say zipalign --help it'll show that man page.

Jan 05 '24 01:01 jart

I've just made the --help flag much more helpful for each program. Every llamafile will now be able to display the rendered man page when that flag is passed in the terminal or console.

For example, what does llamafile-perplexity actually do and how is it different from llamafile or llamafile-server? I would be happy to help with docs.

Good question. I don't think I understand it, but I've made my best pass at improving the manual. I know the llama.cpp authors will break determinism if it means perplexity scores go down. One thing I've seen they like to measure is how well the software and model are together able to reproduce content from wikipedia. The closer to being a verbatim regurgitation, the better the perplexity score will be. That helps sort out the subjective differences of opinion w.r.t. breaking changes. It can also give you a yardstick that tells you, by how much, a quantization format is compromising information retrieval.

LLAMAFILE-PERPLEXITY(1)     General Commands Manual    LLAMAFILE-PERPLEXITY(1)

NAME
     llamafile-perplexity - LLM benchmarking tool

SYNOPSIS
     llamafile-perplexity [flags...]

DESCRIPTION
     Perplexity is one of the most common metrics for evaluating language
     models. The llamafile-perplexity program can be used to gauge the quality
     of an LLM implementation. It is defined as the exponentiated average
     negative log-likelihood of a sequence, calculated with exponent base e.

OPTIONS
     The following options are available:

     -h, --help
             Show help message and exit.

     -m FNAME, --model FNAME
             Model path (default: models/7B/ggml-model-f16.gguf)

     -f FNAME, --file FNAME
             Raw data input file.

     -t N, --threads N
             Number of threads to use during generation (default: nproc/2)

     -s SEED, --seed SEED
             Random Number Generator (RNG) seed (default: -1, use random seed
             for < 0)

EXAMPLE
     One dataset commonly used in the llama.cpp community for measuring
     perplexity is wikitext-2-raw. To use it when testing how well both your
     model and llamafile are performing you could run the following:

     wget https://cosmo.zip/pub/datasets/wikitext-2-raw/wiki.test.raw
     llamafile-perplexity -m model.gguf -f wiki.test.raw -s 31337

     This can sometimes lead to surprising conclusions, like how Q5 weights
     might be better for a particular model than Q6.

SEE ALSO
     llamafile(1)

Llamafile Manual               December 5, 2023               Llamafile Manual

You're encouraged to contribute to the documentation too! If you're willing to learn troff, the authoritative documentation files end with a .1 extension. That's what gets turned into the PDF / PostScript / man documentation in our releases. The most important file is here: https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/llamafile.1 The Troff syntax is explained here: https://manpages.ubuntu.com/manpages/trusty/pt/man7/mdoc.samples.7.html If you're unfamiliar with Troff, then just send us a pull request editing the README markdown. I'll convert it to Troff for you!

Jan 05 '24 03:01 jart

Oh last thing worth mentioning. Here's your dev cycle if you edit the .1 Troff manuals:

nano llamafile/llamafile.1
man llamafile/llamafile.1

The way I'm converting them to PDF is:

groff -Tps -man llamafile/llamafile.1 >llamafile/llamafile.ps
ps2pdf llamafile/llamafile.ps llamafile/llamafile.pdf

Jan 05 '24 03:01 jart

Thank you for fixing the man pages. However, llamafile would benefit from user docs with a similar level of detail found on llama.cpp, i.e., https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md. I'm working my way through the options and adding annotations. I'll submit a PR for review when I'm done.

Jan 05 '24 04:01 spara

Oh good point. I missed that README. Yes we can add that content.

Jan 05 '24 21:01 jart

If we copy and paste all that content, then we should add llama.cpp's MIT notice to the top of the llamafile.1 file.

Jan 05 '24 21:01 jart

@spara How long do you want me to hold off on adding that README content to the Troff file? Could I get an ETA on your PR?

Jan 06 '24 19:01 jart

llamafile llamafile copied to clipboard

Documentation for other llamafile binaries

llamafile
llamafile copied to clipboard