How to set ParallelConfig and SchedulerConfig?

Open wjy3326 opened this issue 2 years ago • 1 comments

Is ParallelConfig.pipeline_parallel_size used on multiple gpu cards? Can it be set to the number of GPU cards? Does it relate to processing multiple prompts and generating multiple results in parallel? For example, if there are 2 gpu cards and 7 requests, will it distribute the 7 requests simultaneously to the 2 gpu cards? How is the allocation done? Also, what do the parameters "max_num_batched_tokens" and "max_num_seqs" represent in SchedulerConfig? How can I set it to preserve longer context?

Jul 04 '23 11:07 wjy3326

vLLM currently does not support pipeline parallelism. The ParallelConfig.pipeline_parallel_size attribute is for future use. When multiple GPUs are used, vLLM leverages tensor parallelism to shard the model and inputs evenly to all GPU workers. Therefore, in your example, 7 requests are all sharded to the 2 GPUs, and the GPUs communicate the intermediate tensors using NCCL all reduce.

Jul 04 '23 17:07 WoosukKwon