llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

mpi : attempt inference of 65B LLaMA on a cluster of Raspberry Pis

Open ggerganov opened this issue 2 years ago • 54 comments

Now that distributed inference is supported thanks to the work of @evanmiller in #2099 it would be fun to try to utilize it for something cool. One such idea is to connect a bunch of Raspberry Pis in a local network and run the inference using MPI:

# sample cluster of 8 devices (replace with actual IP addresses of the devices)
$ cat ./hostfile
192.168.0.1:1
192.168.0.2:1
192.168.0.3:1
192.168.0.4:1
192.168.0.5:1
192.168.0.6:1
192.168.0.7:1
192.168.0.8:1

# build with MPI support
$ make CC=mpicc CXX=mpicxx LLAMA_MPI=1 -j

# run distributed inference over 8 nodes
$ mpirun -hostfile ./hostfile -n 8 ./main -m /mnt/models/65B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 64

Here we assume that the 65B model data is located on a network share in /mnt and that mmap works over a network share. Not sure if that is the case - if not, then it would be more difficult to perform this experiment.

Looking for people with access to the necessary hardware to perform this experiment

ggerganov avatar Jul 10 '23 16:07 ggerganov

I wonder If this capability could be integrated on https://github.com/pry0cc/axiom. Axiom allows to spin off multiple cloud instances in minutes and for some stuff run distributed scripts/loads. Also allows to terminate the instances realy quick so you can use the instances just for cuple of minutes if needed; this also keep the bill low.

EdwardDali avatar Jul 10 '23 17:07 EdwardDali

I could try simulating it. 8 VMs with 8GB of ram.

USBhost avatar Jul 10 '23 22:07 USBhost

Cool idea, I have some more powerful embedded devices(RISC-V CPU) can integrated in a cluster. Expect this experiment and I am willing to deploy it on RISC-V cluster.

hchenphd avatar Jul 11 '23 02:07 hchenphd

Ordered the parts today, and should be here same time tomorrow (6 x Raspberry Pi 4 - 8GB variants), in the meantime, I'm setting up a local mpi cluster on VMs to test the inferencing, pick up your Pi(s) now a shortage is coming!

theycallmeloki avatar Jul 11 '23 08:07 theycallmeloki

and that mmap works over a network share.

had issues with this in the past (couble of weeks). It first works, I can read the full file no problem, but it suddenly stopps and kills the program.

Green-Sky avatar Jul 11 '23 08:07 Green-Sky

and that mmap works over a network share.

had issues with this in the past (couble of weeks). It first works, I can read the full file no problem, but it suddenly stopps and kills the program.

NFS or CIFS?

USBhost avatar Jul 11 '23 14:07 USBhost

NFS or CIFS?

CIFS , between 2 ubuntu machines

Green-Sky avatar Jul 11 '23 15:07 Green-Sky

@theycallmeloki Hope I didn't set the expectations too high - even if this runs, the performance is expected to be really terrible. Likely few (tens of) seconds per token for 65B. It's mostly a fun experiment - don't think it would have any practical use.

ggerganov avatar Jul 11 '23 16:07 ggerganov

@theycallmeloki Hope I didn't set the expectations too high - even if this runs, the performance is expected to be really terrible. Likely few (tens of) seconds per token for 65B. It's mostly a fun experiment - don't think it would have any practical use.

The moment you said raspberry pi I knew we were in the meme train.

USBhost avatar Jul 11 '23 19:07 USBhost

Well the same experiment can be done with a bunch of gaming GPUs (e.g. 8x 6GB or 6x 8GB) and it would make more sense in terms of performance. But running a 65B model on RPi cluster sounds like a more historic achievement 😄

ggerganov avatar Jul 11 '23 19:07 ggerganov

@ggerganov Nope, not at all, I was going through the discussions and realized there is some room to add value around the inferencing pipelines, I can also imagine varying the size of the virtual nodes in the Pi cluster and tweaking the partitioning of the model could lead to better tokens/second and this setup costs approximately 1 order of a magnitude cheaper compared to any other off-the-shelf self hostable setup to run a 65B model (I'm looking at the M2 Ultra Studios) so even if it's slow I think it's likely to be worth it for novel reasons alone 😄

theycallmeloki avatar Jul 11 '23 20:07 theycallmeloki

@theycallmeloki Hope I didn't set the expectations too high - even if this runs, the performance is expected to be really terrible. Likely few (tens of) seconds per token for 65B. It's mostly a fun experiment - don't think it would have any practical use.

The moment you said raspberry pi I knew we were in the meme train.

FWIW a certain popular but very early report of ~10s/token was a bit exaggerated - Actual number right now is closer to 1.6s/token on Pi 4B 8GB for a q6_k quantized 7B model, which it just barely fits with OS and GUI. Even the Pi is memory bandwidth bound in this use case, and -t 4 is actually a bit slower than -t 3. The board has somewhere around 4GB/s of memory bandwidth. Running headless might also speed things up a bit given the architecture.

@ggerganov Nope, not at all, I was going through the discussions and realized there is some room to add value around the inferencing pipelines, I can also imagine varying the size of the virtual nodes in the Pi cluster and tweaking the partitioning of the model could lead to better tokens/second and this setup costs approximately 1 order of a magnitude cheaper compared to any other off-the-shelf self hostable setup to run a 65B model (I'm looking at the M2 Ultra Studios) so even if it's slow I think it's likely to be worth it for novel reasons alone 😄

The actual cheapest right now might be (used) Ryzen 5800X/5700G, the corresponding motherboard, peripherals, and 64GB of the fastest DDR4 RAM you can find in matched 32GB modules. The latter had become quite cheap after DDR5 rollout, and can be had for the price of some three to four Pi 4 8GB.

But no, that is not nearly as interesting!

JWNoctis avatar Jul 12 '23 02:07 JWNoctis

Believe I might have gotten the local environment up and running on the Pis (Confirmed that I ran a hello-world example first to ensure mpi itself was running smoothly)

Moved the 65B model to each pi using scp (256GB microsd cards, so Im hoping I do not need to mmap on network drive) Compiled the binary using MPI instructions and copied the ./main to all the Pis Running with

mpirun -hostfile ./hostfile -n 6 ./main -m /home/laneone/ggml-LLaMa-65B-q4_0.bin -p "I believe the meaning of life is" -n 128

I'm unable to determine how to split this model into tinier chunks so that it can fit on the induvidual Pis, right now I reckon it's trying to load the entire model into each Pi which is probably why it is failing, logs below

main: build = 826 (975221e)
main: seed  = 1689230379
main: build = 826 (975221e)
main: seed  = 1689230379
main: build = 826 (975221e)
main: seed  = 1689230379
main: build = 826 (975221e)
main: seed  = 1689230379
main: build = 826 (975221e)
main: seed  = 1689230379
main: build = 826 (975221e)
main: seed  = 1689230379
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
error loading model: mmap failed: Cannot allocate memory
error loading model: mmap failed: Cannot allocate memory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/laneone/ggml-LLaMa-65B-q4_0.bin'
main: error: unable to load model
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/laneone/ggml-LLaMa-65B-q4_0.bin'
main: error: unable to load model
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
error loading model: mmap failed: Cannot allocate memory
error loading model: mmap failed: Cannot allocate memory
error loading model: mmap failed: Cannot allocate memory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/laneone/ggml-LLaMa-65B-q4_0.bin'
main: error: unable to load model
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
error loading model: mmap failed: Cannot allocate memory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/laneone/ggml-LLaMa-65B-q4_0.bin'
main: error: unable to load model
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/laneone/ggml-LLaMa-65B-q4_0.bin'
main: error: unable to load model
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/laneone/ggml-LLaMa-65B-q4_0.bin'
main: error: unable to load model
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[7058,1],4]
  Exit code:    1
--------------------------------------------------------------------------

theycallmeloki avatar Jul 13 '23 06:07 theycallmeloki

Hm, my expectation is that mmap wouldn't require to load the entire model. Does it make a difference if you try this patch:

diff --git a/llama-util.h b/llama-util.h
index 43b6f05..1c0502f 100644
--- a/llama-util.h
+++ b/llama-util.h
@@ -177,10 +177,7 @@ struct llama_mmap {
         int fd = fileno(file->fp);
         int flags = MAP_PRIVATE;
         // prefetch/readahead impairs performance on NUMA systems
-        if (numa) { prefetch = 0; }
-#ifdef __linux__
-        if (prefetch) { flags |= MAP_POPULATE; }
-#endif
+        prefetch = 0;
         addr = mmap(NULL, file->size, PROT_READ | PROT_WRITE, flags, fd, 0);
         if (addr == MAP_FAILED) {
             throw std::runtime_error(format("mmap failed: %s", strerror(errno)));

I'm just poking here - probably someone who understands better how it works can chime in. The backup plan is to split the model into parts and have each node load it, but this would require some extra work to achieve

ggerganov avatar Jul 13 '23 08:07 ggerganov

Yes, the patch was definitely helpful, prior to this I was parallelly also trying out a 7B model just to ensure the whole process was working, which used to fail with a similar error message as above, but after this patch, the 7B model seems to be running on the Pis

mpirun -hostfile ./hostfile -n 6 ./main -m /home/laneone/vicuna-7b-1.1.ggmlv3.q6_K.bin -p "I believe the meaning of life is" -n 128
main: build = 826 (975221e)
main: seed  = 1689239657
main: build = 826 (975221e)
main: seed  = 1689239657
main: build = 826 (975221e)
main: seed  = 1689239657
main: build = 826 (975221e)
main: seed  = 1689239657
main: build = 826 (975221e)
main: seed  = 1689239657
main: build = 826 (975221e)
main: seed  = 1689239657
llama.cpp: loading model from /home/laneone/vicuna-7b-1.1.ggmlv3.q6_K.bin
llama.cpp: loading model from /home/laneone/vicuna-7b-1.1.ggmlv3.q6_K.bin
llama.cpp: loading model from /home/laneone/vicuna-7b-1.1.ggmlv3.q6_K.bin
llama.cpp: loading model from /home/laneone/vicuna-7b-1.1.ggmlv3.q6_K.bin
llama.cpp: loading model from /home/laneone/vicuna-7b-1.1.ggmlv3.q6_K.bin
llama.cpp: loading model from /home/laneone/vicuna-7b-1.1.ggmlv3.q6_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 7064.42 MB (+ 1026.00 MB per state)
llama_model_load_internal: mem required  = 7064.42 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 7064.42 MB (+ 1026.00 MB per state)
llama_model_load_internal: format     = ggjt v3 (latest)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 7064.42 MB (+ 1026.00 MB per state)
llama_model_load_internal: mem required  = 7064.42 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_model_load_internal: mem required  = 7064.42 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 I believe the meaning of life is to find out who you are and what you want to do in this world. And then use your talents, strengths

However when I load the 65B model a similar error is thrown when the patch wasnt in place (I'm not sure about this, tbh)

mpirun -hostfile ./hostfile -n 6 ./main -m /home/laneone/ggml-LLaMa-65B-q4_0.bin -p "I believe the meaning of life is milady because" -n 128
main: build = 826 (975221e)
main: seed  = 1689240041
main: build = 826 (975221e)
main: seed  = 1689240041
main: build = 826 (975221e)
main: seed  = 1689240041
main: build = 826 (975221e)
main: seed  = 1689240041
main: build = 826 (975221e)
main: seed  = 1689240041
main: build = 826 (975221e)
main: seed  = 1689240041
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
error loading model: mmap failed: Cannot allocate memory
error loading model: mmap failed: Cannot allocate memory
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
error loading model: mmap failed: Cannot allocate memory
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
error loading model: mmap failed: Cannot allocate memory
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
error loading model: mmap failed: Cannot allocate memory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/laneone/ggml-LLaMa-65B-q4_0.bin'
main: error: unable to load model
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/laneone/ggml-LLaMa-65B-q4_0.bin'
main: error: unable to load model
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
error loading model: mmap failed: Cannot allocate memory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/laneone/ggml-LLaMa-65B-q4_0.bin'
main: error: unable to load model
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/laneone/ggml-LLaMa-65B-q4_0.bin'
main: error: unable to load model
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[6572,1],4]
Exit code:    1
--------------------------------------------------------------------------

I would like to work on the splitting of the model and have each node load just the weights specific to that model, please do give me some rough pointers on how I can approach this, I understand there are ckpt to diffusers and diffusers to ckpt, I could probably patch one side to split the model prior to writing to disk, so that way running both the processes could end up in a split file that can be loaded by each Pi individually

In a parallel track, do you think adding swap storage equivalent to the model size would help? I can imagine it trying to load the entire model on the microsd card and being able to look through the specific portions it needs to for it's inferencing logic might help in this regard

theycallmeloki avatar Jul 13 '23 09:07 theycallmeloki

Update: The swap seems to have done the trick, I added 50GB swap file to each Pi and was able to run the 65B model using:

mpirun -hostfile ./hostfile -n 6 ./main -m /home/laneone/ggml-LLaMa-65B-q4_0.bin -p "I believe the meaning of life is milady because" -n 128

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0

llama_model_load_internal: mem required  = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB
 I believe the meaning of life is milady because she is

I think swap was likely not the correct approach as I fear the benefit of being able to use mpi to load only into RAM was so that inference speed could be higher, now that the entire model can be brought up in one Pi technically I can run 6 conversations in parallel, one on each Pi, and they'd be just as slow as right now (about one token every 10-12 mins) thereby I'm leaning towards believing the split model approach could be better and will try on the same lines (Also I can't run k8s on this if I enable swap since etcd stashes into swap and messy things happen, which I would like to be able to do down the line)

theycallmeloki avatar Jul 13 '23 17:07 theycallmeloki

@theycallmeloki try setting vm.overcommit_memory to 1 https://www.kernel.org/doc/html/v5.1/vm/overcommit-accounting.html

USBhost avatar Jul 13 '23 18:07 USBhost

@theycallmeloki

If everything works as expected, it won't swap. Something is still not ok and I think it is swapping due to the large KV cache. Let's try to limit it by reducing the context length by x8:

  • Apply the following patch:
diff --git a/llama.cpp b/llama.cpp
index 2d09d6c..3ebdca4 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -129,11 +129,11 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH1()
 static const std::map<e_model, size_t> & MEM_REQ_KV_SELF()
 {
     static std::map<e_model, size_t> k_sizes = {
-        { MODEL_3B,    682ull * MB },
-        { MODEL_7B,   1026ull * MB },
-        { MODEL_13B,  1608ull * MB },
-        { MODEL_30B,  3124ull * MB },
-        { MODEL_65B,  5120ull * MB },
+        { MODEL_3B,    682ull * MB / 8 },
+        { MODEL_7B,   1026ull * MB / 8 },
+        { MODEL_13B,  1608ull * MB / 8 },
+        { MODEL_30B,  3124ull * MB / 8 },
+        { MODEL_65B,  5120ull * MB / 8 },
     };
     return k_sizes;
 }
  • Run the same MPI command as before by adding the following argument -c 256 to limit the context length to 256 tokens.

Btw, how much RAM does each RPi have?

ggerganov avatar Jul 14 '23 18:07 ggerganov

Looks like it's the time to make a few virtual machines to test this out.

USBhost avatar Jul 15 '23 03:07 USBhost

@ggerganov @USBhost

I have now disabled the 50GB swap file on the Pi each Pi has 8GB RAM (300mb for headless bootup) I added the patch and also set vm.overcommit_memory = 1 The model seems to be able to boot up even without the swap storage now I believe we are now able to generate tokens on a cluster of RPis! 65B model :grinning:

mpirun -hostfile ./hostfile -n 6 ./main -m /home/laneone/ggml-LLaMa-65B-q4_0.bin -p "I believe the meaning of life is milady because" -n 128 -c 256
main: build = 826 (975221e)
main: seed  = 1689406303
main: build = 826 (975221e)
main: seed  = 1689406303
main: build = 826 (975221e)
main: seed  = 1689406303
main: build = 826 (975221e)
main: seed  = 1689406303
main: build = 826 (975221e)
main: seed  = 1689406303
main: build = 826 (975221e)
main: seed  = 1689406303
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama.cpp: loading model from /home/laneone/ggml-LLaMa-65B-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+  640.00 MB per state)
llama_new_context_with_model: kv self size  =  640.00 MB
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+  640.00 MB per state)
llama_new_context_with_model: kv self size  =  640.00 MB
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+  640.00 MB per state)
llama_new_context_with_model: kv self size  =  640.00 MB
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: n_head     = 64
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_model_load_internal: mem required  = 38610.47 MB (+  640.00 MB per state)
llama_model_load_internal: mem required  = 38610.47 MB (+  640.00 MB per state)
llama_new_context_with_model: kv self size  =  640.00 MB

llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.19 MB
llama_new_context_with_model: kv self size  =  640.00 MB
llama_model_load_internal: mem required  = 38610.47 MB (+  640.00 MB per state)
llama_new_context_with_model: kv self size  =  640.00 MB
system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 256, n_batch = 512, n_predict = 128, n_keep = 0

I believe the meaning of life is milady because you have to

A point of contention that I am fearing is that not all the Pis are involved in the mpirun as only 2/6 of them seem to have spikes in compute as shown here

Screenshot from 2023-07-15 13-12-00

theycallmeloki avatar Jul 15 '23 07:07 theycallmeloki

@theycallmeloki

I believe we are now able to generate tokens on a cluster of RPis! 65B model 😀

Let's goooo !! 😄

A point of contention that I am fearing is that not all the Pis are involved in the mpirun as only 2/6 of them seem to have spikes in compute as shown here

Yes, this the expected behaviour. The MPI implementation currently just supports pipeline parallelisation, so each node processes part of the pipeline (i.e. few layers of the graph) and passes the results to the next node. This allows each node to "see" only a part of the model and thus distribute it across the cluster.

This is in contrast to tensor parallelisation where all nodes can work in parallel on each part of the graph. However, this would require all nodes to see the entire model, which in our experiment is not viable since we cannot fit the entire model in the RPi RAM.

In any case, I consider the experiment already a success! Let us know what is the inference speed that you observe on this setup

Maybe check if vm.overcommit_memory = 1 is necessary or we can do the same without it.

Also make sure the generation makes sense -- I believe the meaning of life is milady because you have to does not sound very coherent, so there might still be some issue 😄

One more thing: maybe adding --mlock could have some positive effects too (not sure).

And another thing: update to latest master since there was a regression recently (32c54116318929c90fd7ae814cf9b5232cd44c36)

ggerganov avatar Jul 15 '23 09:07 ggerganov

Reading your comment again, I might have misunderstood. The nodes should continuously take turns to process the next part of the pipeline. If only the same 2 nodes always have spikes then there is definitely something wrong

ggerganov avatar Jul 15 '23 09:07 ggerganov

Can I freely parallelize the model further to ensure I am able to densely pack the inferencing? is there a cutoff at which point it'll be more about the DAG being constructed than actually computing the tokens? asking as I am not too sure on how mpi does things, for example:

When I run a hello world example on 6 threads, instead of all of them responding, I get replies from 2 instantly

mpirun -v -hostfile hostfile -np 6 ./hello_world
Hello world from processor spartan1, rank 1 out of 6 processors
Hello world from processor spartan1, rank 2 out of 6 processors
Hello world from processor spartan1, rank 3 out of 6 processors
Hello world from processor spartan1, rank 0 out of 6 processors
Hello world from processor spartan2, rank 4 out of 6 processors
Hello world from processor spartan2, rank 5 out of 6 processors

Just to verify it wasn't an access issue, when I increase to 24 slots, all of them respond

mpirun -v -hostfile hostfile -np 24 ./hello_world
Hello world from processor spartan1, rank 1 out of 24 processors
Hello world from processor spartan1, rank 0 out of 24 processors
Hello world from processor spartan1, rank 2 out of 24 processors
Hello world from processor spartan1, rank 3 out of 24 processors
Hello world from processor spartan3, rank 8 out of 24 processors
Hello world from processor spartan3, rank 9 out of 24 processors
Hello world from processor spartan3, rank 10 out of 24 processors
Hello world from processor spartan4, rank 13 out of 24 processors
Hello world from processor spartan4, rank 14 out of 24 processors
Hello world from processor spartan5, rank 16 out of 24 processors
Hello world from processor spartan3, rank 11 out of 24 processors
Hello world from processor spartan2, rank 6 out of 24 processors
Hello world from processor spartan6, rank 20 out of 24 processors
Hello world from processor spartan2, rank 7 out of 24 processors
Hello world from processor spartan4, rank 15 out of 24 processors
Hello world from processor spartan5, rank 17 out of 24 processors
Hello world from processor spartan6, rank 21 out of 24 processors
Hello world from processor spartan2, rank 4 out of 24 processors
Hello world from processor spartan4, rank 12 out of 24 processors
Hello world from processor spartan5, rank 18 out of 24 processors
Hello world from processor spartan6, rank 23 out of 24 processors
Hello world from processor spartan2, rank 5 out of 24 processors
Hello world from processor spartan5, rank 19 out of 24 processors
Hello world from processor spartan6, rank 22 out of 24 processors

theycallmeloki avatar Jul 15 '23 09:07 theycallmeloki

Show the contents of hostfile

cc @evanmiller for advice

ggerganov avatar Jul 15 '23 09:07 ggerganov

This is what I am using for hostfile

192.168.0.60:1
192.168.0.61:1
192.168.0.62:1
192.168.0.63:1
192.168.0.64:1
192.168.0.65:1

theycallmeloki avatar Jul 15 '23 09:07 theycallmeloki

Try using the following:

192.168.0.60 slots=1
192.168.0.61 slots=1
192.168.0.62 slots=1
192.168.0.63 slots=1
192.168.0.64 slots=1
192.168.0.65 slots=1

ggerganov avatar Jul 15 '23 09:07 ggerganov

Pretty sure that helped because now there are uniform spikes on at least 1 core on each Pi, I'll keep testing and try out the different parameters like memory overcommit as well

theycallmeloki avatar Jul 15 '23 09:07 theycallmeloki

The mpirun -v -hostfile hostfile -np 6 ./hello_world command should return each node exactly once. If it is not the case, there is no point to try llama.cpp

ggerganov avatar Jul 15 '23 09:07 ggerganov

Yes, I think so too, I will try to debug what that might be related to, just to confirm this is what I had used for the hello world, unless there's already a binary I can use to test the mpi side of things I am not sure if the setup is reproducible

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
    MPI_Init(NULL, NULL);
    
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);
    
    printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size);
    
    MPI_Finalize();
}

theycallmeloki avatar Jul 15 '23 09:07 theycallmeloki

Looks OK, though I would add a 1 or 2 second sleep somewhere in main since I am not sure what MPI would do if the process ends very quickly. I guess it might launch the next one on the same node -- not sure

ggerganov avatar Jul 15 '23 09:07 ggerganov