mistral.rs Distributed inference and tensor parallelism plans

Distributed inference and tensor parallelism plans

Open EricLBuehler opened this issue 6 months ago • 3 comments

With the recent advent of large models (take Llama 3.1 405b, for example!), distributed inference support is a must! We currently support naive device mapping, which works by allowing a combination of CPU and multi-GPU via copying activations to the respective device. Assuming device mapping on the GPU only, this is slower than ideal because it invokes:

GPU <> CPU synchronization for the data transfer
GPU <> GPU copy

A method to alleviate this is called tensor parallelism (see 1, 2), which is where we split the weights between different processors.

Tagging relevant people, please feel free to give comments! @guoqingbao @oldgithubman @ilookee @ghchris2021

Steps for distributed inference and tensor parallelism

(quick & easy) Convert device mapping system to model topology system
NCCL tensor parallelism (#617)
Network distributed inference (#542), will be similar to current naive device mapping

More references

Aug 11 '24 00:08 EricLBuehler

mistral.rs mistral.rs copied to clipboard

Distributed inference and tensor parallelism plans

Steps for distributed inference and tensor parallelism

More references

mistral.rs
mistral.rs copied to clipboard