superduper icon indicating copy to clipboard operation
superduper copied to clipboard

[DistEnv] Strategy for `ray` on multiple nodes, with sharding as option

Open blythed opened this issue 1 year ago • 4 comments

@kartik4949 to add information, discussion points, diagrams, links.

blythed avatar Dec 30 '23 09:12 blythed

There are multiple ways to achieve model parallelism for general torch models.

  1. Deepspeed
  2. FSDP

The above are two most popular libraries which enable model parallelism. These libraries are basically model parallelism algorithms with gpu inter communication support.

i.e Deepspeed will take you models and partition it into parts and then manage the communication between partitions on multiple gpu. Source: https://www.deepspeed.ai/inference/

So if user has a model he can use deep speed to shard it across multiple models.

Lets take 2 scenarios

  1. 1 Machine with 4 GPUS
  2. 2 Machine with 2 GPUs each (4 Total)

From now on, we will be referring to the above scenario list!

If user falls under scenario 1, Directly using deepspeed will suffice and he can achieve model parallelism, but im not sure if we will have a dashboard to view the process. Here deep speed will create one process per GPU and partition the model for inference.

Now comes Scenario 2, if user has 2 machines, sharding the model on multiple machines might not be the best scenario as the inter node communication can become a bottleneck!

moreover inter node model parallelism with deepspeed requires some manual tasks like hostfile creation, etc

But, ray can handle the inter node/machine communication very elegantly so the idea is what if we create a local intra machine gpu worker group with deepspeed and create this group on each node/machine

image

Lets take a look at above diagram.

The blue box is ray which takes a batch of input data and distributes across the two machines and each partitioned batch on the machine is given to deepspeed which has a copy of model sharded/distributed on 2 Gpus in that machine so for e.g

If batch size is 16

8 batch size data (partition 1) will be given to machine 1 and model will be distritbued/paritioned on the 2 gpus in that machine with deepspeed.

same happens on machine 2.

The result it calculated and gathered back by ray and returned on client node.

kartik4949 avatar Dec 30 '23 12:12 kartik4949

@kartik4949 great explanation!

Questions:

  1. how much support do we have for this scenario using vLLM?
  2. is there any difference between FSDP and deepspeed?

blythed avatar Dec 30 '23 18:12 blythed

@kartik4949 great explanation!

Questions:

  1. how much support do we have for this scenario using vLLM?
  2. is there any difference between FSDP and deepspeed?

The vLLM uses the method like this to support the tensor model parallel

It won't use what's here.

But our support for basic transformers or torch models can use this

jieguangzhou avatar Jan 02 '24 09:01 jieguangzhou

Great If we implement this, we'll be able to completely offload the model computation layer to a Ray cluster. Also, I suggest completing this in conjunction with the this https://github.com/SuperDuperDB/superduperdb/discussions/1604.

Otherwise, the model will still load onto the local machine, which would make this feature somewhat underwhelming.

jieguangzhou avatar Jan 02 '24 11:01 jieguangzhou