llama.cpp 🚀 Dockerize llamacpp

First of all, thank you for the effort of the entire community. The work they do is impressive.

I'm going to try to do my bit by dockerizing this client and making it more accessible.

If you have time, I would recommend creating a pipeline to publish the image to dockerhub, so it would be easier to use, ej: docker pull ggerganov/llamacpp or similar.

To make it work, just execute these commands:

Build image (atm not exists in dockerhub) docker build -t llamacpp .
Run program:

docker run -v ./models:/models llamacpp -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512

If you want to run in interactive mode, don't forget to tell Docker that too.

docker run -v ./models:/models llamacpp -m /models/7B/ggml-model-q4_0.bin -t 8 -n 256 --repeat_penalty 1.0 --color -i -r "User:" \
                                           -p \
"Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:"

Mar 14 '23 13:03 bernatvadell

Something weird here. What am I doing wrong?

$ cat /data/llama/7B/params.json;echo
{"dim": 4096, "multiple_of": 256, "n_heads": 32, "n_layers": 32, "norm_eps": 1e-06, "vocab_size": -1}
$ docker run -v models:/models llamacpp-converter "/data/llama/7B" 1
Traceback (most recent call last):
  File "/app/convert-pth-to-ggml.py", line 67, in <module>
    with open(fname_hparams, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/data/llama/7B/params.json'

Mar 14 '23 15:03 gjmulder

I believe you’d need to run docker run -v /data/llama:/models llamacpp-converter "/models/7B" 1

Mar 14 '23 15:03 j-f1

I think you don't have the volume mounted correctly.

You have to think that when you run the container, you are doing it in isolation, that is, you do not have access to the files on your host. To do this you need to expose the files through a volume.

I detail it below:

docker run 
  # mount volume to expose my current working directory (pwd) subfolder "models" into container path /models-only-exists-in-your-container
  -v $(pwd)/models:/models-only-exists-in-your-container
  # specify which image you want to run
  llamacpp-main 
 # llamacpp's normal arguments
  -m /models-only-exists-in-your-container/7B/ggml-model-q4_0.bin 
  -p "Building a website can be done in 10 simple steps:" 
  -t 8 
  -n 512

Mar 14 '23 15:03 bernatvadell

I believe you’d need to run docker run -v /data/llama:/models llamacpp-converter "/models/7B" 1

That works. Thx.

Mar 14 '23 15:03 gjmulder

Where's the quantization step occurring?

Logically this should occur in the tools Dockerfile, which implies running make there too and having a wrapper script to call first convert-pth-to-ggml.py and then quantize.

However, there is discussion about adding 8-bit quantization, so really it might be better to first call the wrapper script with a param say --convert to do the conversion first, then call it again to quantize with --quantize <q4_0|q8_0> for maximum flexibility. e.g.

docker run -v models:/models llamacpp-converter --convert "/models/7B/" 1
docker run -v models:/models llamacpp-converter --quantize q4_0 "/models/7B/"

EDIT: Issue #106 indicates that passing additional params to ./quantize.sh will become necessary as well.

Mar 14 '23 16:03 gjmulder

Where's the quantization step occurring?

Logically this should occur in the tools Dockerfile, which implies running make there too and having a wrapper script to call first convert-pth-to-ggml.py and then quantize.

However, there is discussion about adding 8-bit quantization, so really it might be better to first call the wrapper script with a param say --convert to do the conversion first, then call it again to quantize with --quantize <4bit|8bit> for maximum flexibility. e.g.
docker run -v models:/models llamacpp-converter --convert "/models/7B/" 1
docker run -v models:/models llamacpp-converter --quantize 4bit "/models/7B/"
EDIT: Issue #106 indicates that passing additional params to ./quantize.sh will become necessary as well.

Done. docker run -v $(pwd)/models:/models llamacpp-tools --quantize "/models/7B/ggml-model-f16.bin" "/models/7B/ggml-model-q4_0.bin" docker run -v $(pwd)/models:/models llamacpp-tools --convert "/models/7B/" 1

Mar 14 '23 17:03 bernatvadell

Great job. Just a suggestion: What about adding the @gjmulder build instructions to the README?

Mar 14 '23 17:03 borgstad

Great job. Just a suggestion: What about adding the @gjmulder build instructions to the README?

We can add the instructions to compile the image locally. However, the simplest thing would be to publish the docker image in "dockerhub" and it would not really be necessary to clone repositories or anything similar, just have Docker Engine or Docker Desktop installed.

docker run -v $(pwd)/models:/models ggerganov/llamacpp-tools --convert "/models/7B/" 1 or docker run -v $(pwd)/models:/models ggerganov/llamacpp -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512

Mar 14 '23 17:03 bernatvadell

I can do for you, I'll commit here before merge pr

Mar 14 '23 19:03 bernatvadell

Hi @ggerganov

I've created a new github action, which is only fired when a master is pushed.

If you look at the file, I have published the image in my account to test locally.

I would recommend you to create a dockerhub account (if you don't already have one) and create a new repository with the name you want (eg llamacpp)

If you register with the user ggerganov, then we can publish the image: ggerganov/llamacpp

Once you're registered, you can generate the token for the github action to have access to. You should put both the user and the token in the github secrets:

-DOCKERHUB_USERNAME -DOCKERHUB_TOKEN

Before closing the PR, we must change the name of the image that we have in the pipeline yaml.

For those who want to try it, you can do it with the images that I have published in my account:

ex: light version (only main, 28.32MB):

docker run -v $(pwd)/models:/models bernatvadell/llamacpp:latest -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512

full version (3.61GB):

docker run -v $(pwd)/models:/models bernatvadell/llamacpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512

docker run -v $(pwd)/models:/models bernatvadell/llamacpp:full --convert "/models/7B/" 1

docker run -v $(pwd)/models:/models bernatvadell/llamacpp:full --quantize "/models/7B/ggml-model-f16.bin" "/models/7B/ggml-model-q4_0.bin" 2

Mar 15 '23 16:03 bernatvadell

Another option could be to use the GitHub registry which wouldn’t need any additional setup beyond pointing the builder to the right image name.

Mar 15 '23 17:03 j-f1

Yep, in any case, it will adapt the yaml to the registry configuration.

Mar 15 '23 17:03 bernatvadell

Another option could be to use the GitHub registry which wouldn’t need any additional setup beyond pointing the builder to the right image name.

Does this mean I don't have to create dockerhub account?

Mar 15 '23 19:03 ggerganov

Short note here, since I am running Docker Desktop on Windows, I needed to change the ${pwd} to %cd%

docker run -v %cd%/models:/models bernatvadell/llamacpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512

docker run -v %cd%/models:/models bernatvadell/llamacpp:full --convert "/models/7B/" 1

docker run -v %cd%/models:/models bernatvadell/llamacpp:full --quantize "/models/7B/ggml-model-f16.bin" "/models/7B/ggml-model-q4_0.bin" 2

Thanks for the great work!!

Mar 15 '23 21:03 Matthias-Johannes-Mack

In light of the recent Docker policy changes, I would recommend to push to ghcr.io instead. See how to login to ghcr.io here: https://github.com/docker/login-action#github-container-registry

In addition, adding a README section on running with Docker would be useful.

Mar 16 '23 05:03 Niek

yes, make sense, tonight I can do it

Mar 16 '23 08:03 bernatvadell

Ok, now the pipeline seems to work correctly.

The flow that I have defined is the following:

Whenever a PR is done, the build of the image will be launched, but the push will not be done. This way we can validate that it continues to compile correctly.
When the push to master is done, then it will compile and push to the github registry:

Light version (only includes the main)

ghcr.io/ggerganov/llama.cpp:light

Full version (includes python, main and quantize scripts)

ghcr.io/ggerganov/llama.cpp:full

Versioned images

On the other hand, the image will also be pushed but versioned by the commit hash.

ghcr.io/ggerganov/llama.cpp:light-<commit_hash> ghcr.io/ggerganov/llama.cpp:full-<commit_hash>

If you have any suggestions, welcome!

Thank you

Mar 16 '23 11:03 bernatvadell

@ggerganov can i do squash?

Mar 16 '23 13:03 bernatvadell

Good morning!

I've included a couple of new commands in the bash tools:

New command to download the indicated model: --download (-d): Download original llama model from CDN: https://agi.gpt4.org/llama/
I have included a command to perform an "all-in-one": --all-in-one (-a): Execute --download, --convert & --quantize

On the other hand, I have updated the README.md file explaining how to start using the Docker image.

Docker

Prerequisites

Docker must be installed and running on your system.
Create a folder to store big models & intermediate files (in ex. im using /llama/models)

Images

We have two Docker images available for this project:

ghcr.io/ggerganov/llama.cpp:full: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
ghcr.io/ggerganov/llama.cpp:light: This image only includes the main executable file.

Usage

The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.

docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B

On complete, you are ready to play!

docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512

or with light image:

docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512

Mar 17 '23 09:03 bernatvadell

@ggerganov can i do squash?

Yes, almost always squash.

Sorry for slow responses - very busy week ..

Mar 17 '23 11:03 ggerganov

llama.cpp llama.cpp copied to clipboard

🚀 Dockerize llamacpp

Light version (only includes the main)

Full version (includes python, main and quantize scripts)

Versioned images

Docker

Prerequisites

Images

Usage

llama.cpp
llama.cpp copied to clipboard