llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Question regarding distributed computing...

Open snapo opened this issue 2 years ago • 7 comments

I have currently access to 20 old computers, each with 32GB ram and 4 cores, 256gb ssd, 1 gbit speed network, connected to a 48port switch. (i could get a lot lot more computers but i dont have enough electricity currently) Would it be somehow possible to distribute the llama model with llama.cpp to the 20 computers to being able to run the 65b model at a moderate speed? What would i have to do to distribute the model on many computers to run it on cpu? i am only interested in inference, not training..... for training i can rent cloud gpu's.

Thanks for any input that would help me / recommendation / problems.

What i see as a problem is how to split the model / models (in case i use other models) efficiently so that network bandwidth isnt the limiting factor.

snapo avatar Apr 13 '23 14:04 snapo

Consider the discussion in this PR. They're discussing limiting much more integrated high-core count CPUs to only 8 (or 4) as more cores does not seem to positively correlate with better performance. I might be misunderstanding, but I think you need faster threads, not more.

Loufe avatar Apr 14 '23 02:04 Loufe

more cores does not seem to positively correlate with better performance

This is somewhat false. The issue in https://github.com/ggerganov/llama.cpp/pull/934 was about the interference of hyperthreaded logical "cores" and Efficiency cores (E-cores) on M1 and recent Intel chips (alderlake and above).

What would i have to do to distribute the model on many computers to run it on cpu?

I think it's a better idea to stick to a single node. Distributed inference is a pretty terrible idea and has high overhead, unless you have a HPC setup. I would suggest sticking to the model (e.g. 30B 4bit quantized) that can run on a single node with 32GB RAM, and then load distributing your requests over those nodes.

jon-chuang avatar Apr 14 '23 04:04 jon-chuang

i understand the single inference.... but wouldnt it be possible to distribute it to 20 computers? I mean with that each layer on a single computer that runs on 4 threads (because there are 4 cores). the connection between the layers contain only the transformer block output (even it means upgrading the disk on all the 20 computers so they have the full network each)

image image

(Image from Wikipedia)

What i mean is for example PC1 provides the input embedding, the last pc provides the softmax output and the decoder, all pc's in between do 1 or multiple transformer blocks.

network wise in this way only layer to layer transfer (at least from my noobie understanding) would happen which is very small (input and output from the transformer).

I understand there is no speedup in computing, but i could if that works create thousands of requests parallel (which speeds up total compute).

on the 65B model for example there should be around 10 trillion calculations required / token therefore a single output token will be maximum as fast as the operations and readspeed of the disk.

But what the multi computer system allows is creating a API where we can let multiple "auto-gpt" run or even distribute it like a seti@home computing system where a huge number of requests can happen in parallel.

Even assuming 1 token takes 5 seconds, if you can process with 20 computers 5000 requests in parallel it means 1000 tokens/s/batch which is pretty fast. But each request takes then approximately 10 minute to complete.

Just my 2 cents on the idea why it would be nice to have in my view.

snapo avatar Apr 14 '23 05:04 snapo

There already exists many ways to distribute across tensor and operator. See e.g. https://alpa.ai/index.html

I believe this is out of scope for llama.cpp

jon-chuang avatar Apr 14 '23 05:04 jon-chuang

Thank you very much, i will check out alpa.ai and if it would fit my need :-)

snapo avatar Apr 14 '23 06:04 snapo

From ggml point of view, such distributed computing is completely possible. You simply have to partition your transformer the way you like and load the respective tensors on the respective nodes. You then create the partial compute graphs and should be ready to compute.

The main thing to solve is make the nodes communicate with each other - for example over the network. This is something that will likely never be part of ggml or even llama.cpp since it will bring 3rd party dependencies. So a distributed computing example will likely have to be demonstrated in a separate respository / fork.

Unless, you find a very elegant way to pass and queue messages between the nodes that fits in a few hundred lines of C/C++ code. In that case, this can become a llama.cpp example and I think it will be of great interest. Even if it works only on Linux for example.

ggerganov avatar Apr 14 '23 06:04 ggerganov

Unless, you find a very elegant way to pass and queue messages between the nodes that fits in a few hundred lines of C/C++ code. In that case, this can become a llama.cpp example and I think it will be of great interest. Even if it works only on Linux for example.

If you accept MPI as a dependency, this is actually very possible.

The test should be written using multiple processes to simulate multiple nodes.

jon-chuang avatar Apr 14 '23 06:04 jon-chuang

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 11 '24 01:04 github-actions[bot]