mistral.rs
mistral.rs copied to clipboard
Distributed inference and tensor parallelism plans
With the recent advent of large models (take Llama 3.1 405b, for example!), distributed inference support is a must! We currently support naive device mapping, which works by allowing a combination of CPU and multi-GPU via copying activations to the respective device. Assuming device mapping on the GPU only, this is slower than ideal because it invokes:
- GPU <> CPU synchronization for the data transfer
- GPU <> GPU copy
A method to alleviate this is called tensor parallelism (see 1, 2), which is where we split the weights between different processors.
Tagging relevant people, please feel free to give comments! @guoqingbao @oldgithubman @ilookee @ghchris2021
Steps for distributed inference and tensor parallelism
- (quick & easy) Convert device mapping system to model topology system
- NCCL tensor parallelism (#617)
- Network distributed inference (#542), will be similar to current naive device mapping