superduper
superduper copied to clipboard
[DistEnv] Strategy for `ray` on multiple nodes, with sharding as option
@kartik4949 to add information, discussion points, diagrams, links.
There are multiple ways to achieve model parallelism for general torch models.
- Deepspeed
- FSDP
The above are two most popular libraries which enable model parallelism. These libraries are basically model parallelism algorithms with gpu inter communication support.
i.e Deepspeed will take you models and partition it into parts and then manage the communication between partitions on multiple gpu. Source: https://www.deepspeed.ai/inference/
So if user has a model he can use deep speed to shard it across multiple models.
Lets take 2 scenarios
- 1 Machine with 4 GPUS
- 2 Machine with 2 GPUs each (4 Total)
From now on, we will be referring to the above scenario list!
If user falls under scenario 1, Directly using deepspeed will suffice and he can achieve model parallelism, but im not sure if we will have a dashboard to view the process. Here deep speed will create one process per GPU and partition the model for inference.
Now comes Scenario 2, if user has 2 machines, sharding the model on multiple machines might not be the best scenario as the inter node communication can become a bottleneck!
moreover inter node model parallelism with deepspeed requires some manual tasks like hostfile creation, etc
But, ray can handle the inter node/machine communication very elegantly so the idea is what if we create a local intra machine gpu worker group with deepspeed and create this group on each node/machine
Lets take a look at above diagram.
The blue box is ray which takes a batch of input data and distributes across the two machines and each partitioned batch on the machine is given to deepspeed which has a copy of model sharded/distributed on 2 Gpus in that machine so for e.g
If batch size is 16
8 batch size data (partition 1) will be given to machine 1 and model will be distritbued/paritioned on the 2 gpus in that machine with deepspeed.
same happens on machine 2.
The result it calculated and gathered back by ray and returned on client node.
@kartik4949 great explanation!
Questions:
- how much support do we have for this scenario using vLLM?
- is there any difference between FSDP and deepspeed?
@kartik4949 great explanation!
Questions:
- how much support do we have for this scenario using vLLM?
- is there any difference between FSDP and deepspeed?
The vLLM uses the method like this to support the tensor model parallel
It won't use what's here.
But our support for basic transformers or torch models can use this
Great If we implement this, we'll be able to completely offload the model computation layer to a Ray cluster. Also, I suggest completing this in conjunction with the this https://github.com/SuperDuperDB/superduperdb/discussions/1604.
Otherwise, the model will still load onto the local machine, which would make this feature somewhat underwhelming.