mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Distributed inference and tensor parallelism plans

Open EricLBuehler opened this issue 6 months ago • 3 comments

With the recent advent of large models (take Llama 3.1 405b, for example!), distributed inference support is a must! We currently support naive device mapping, which works by allowing a combination of CPU and multi-GPU via copying activations to the respective device. Assuming device mapping on the GPU only, this is slower than ideal because it invokes:

  1. GPU <> CPU synchronization for the data transfer
  2. GPU <> GPU copy

A method to alleviate this is called tensor parallelism (see 1, 2), which is where we split the weights between different processors.

Tagging relevant people, please feel free to give comments! @guoqingbao @oldgithubman @ilookee @ghchris2021

Steps for distributed inference and tensor parallelism

  1. (quick & easy) Convert device mapping system to model topology system
  2. NCCL tensor parallelism (#617)
  3. Network distributed inference (#542), will be similar to current naive device mapping

More references

EricLBuehler avatar Aug 11 '24 00:08 EricLBuehler