distributed-llama
distributed-llama copied to clipboard
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
Hi @b4rtaz I was tinkering a bit over the weekend and figured it might be possible to create a version of worker/main that accelerates the inference by offloading some work...
@b4rtaz Hey, thank you for your wonderful work. Could you please offer some details about how to add supported model? For example, how to split the network according to structure...
I haven't yet updated the other model conversion scripts yet, but this allows you to convert any llama model that uses safetensor.
Dear Author, Your contribution is critical for the open-source community. The distributed-llama repo has implemented tensor parallelism from scratch. And the result is amazingly significant. However, there are still improvements...
This pull request introduces API functionality to the distributed llama project. The main addition is the implementation of the chat completion endpoint, following the specifications outlined by OpenAI for chat...
The nodes connect, but crash after roughly 3 seconds. Server: ``` sudo main simple-server --weights-float-type q40 --buffer-float-type q40 --nthreads 4 --model ~/dllama_meta-llama-3-8b_q40.bin --tokenizer ~/dllama-llama3-tokenizer.t --workers 192.168.2.212:9998 192.168.2.213:9998 192.168.2.214:9998 192.168.2.215:9998 192.168.2.216:9998...
``` ubuntu@ubuntu:~/llama3/Meta-Llama-3-8B-Instruct$ python3 ../../distributed-llama/converter/convert-llama.py ./ q40 Model name: Target float type: q40 Target file: dllama__q40.bin Traceback (most recent call last): File "/home/ubuntu/llama3/Meta-Llama-3-8B-Instruct/../../distributed-llama/converter/convert-llama.py", line 119, in convert(modelPath, outputFileName, targetFloatType) File "/home/ubuntu/llama3/Meta-Llama-3-8B-Instruct/../../distributed-llama/converter/convert-llama.py",...
Hi there Amazing project by the way, it has given me hopes in being able to run really big models, specifically I'm very excited about the upcoming 400b Llama model...
```sh # 1 worker + inference make docker-1-worker-inference # 3 workers + inference like this: make docker-3-worker-inference WORKERS="172.18.0.2:9997 172.18.0.3:9997 172.18.0.4:9997" ``` my local test on docker containers: (use default checkpoint:...
Hi there I'm busy converting llama 3 70b to the distributed format, but I get the following output: Target float type: q40 Target file: D:\Meta-Llama-3-70B-Instruct-Distributed\dllama_original_q40.bin 💿 Chunking model 1/16... Unknown...