opencompass
opencompass copied to clipboard
[Feature] 请问使用vllm评测时怎么实现类似HF多卡数据并行?
描述该功能
我在评测时的模型type 为vllm,参数如下:
但是显卡占用只使用了一张卡来评测任务
我想让任务划分为几份分别在8张卡上评测,这种功能可以添加吗?还是说可以实现,麻烦解答一下。非常感激!
类似我如果设定为模型type为HF的话,会自动达到这种效果。
是否希望自己实现该功能?
- [ ] 我希望自己来实现这一功能,并向 OpenCompass 贡献代码!
like above cfg, you can set
model_kwargs=dict(tensor_parallel_size=8),
for your case.
@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.
@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.
hi, @liushz , I also want to know how to achieve data parallelism in vLLM when evaluating
Please try NumWorkerPartitioner
https://github.com/open-compass/opencompass/blob/main/opencompass/partitioners/num_worker.py#L17
@tonysy Could you possibly offer a quick example? I'm quite unsure how to ues it. Many thanks for your assistance.
我感觉这个应该是要看VLLM的文档,:https://docs.vllm.ai/en/latest/serving/distributed_serving.html,我tensor_parallel_size设置的和GPU数量一样是可以的。
@IcyFeather233 谢谢你😂,我明白这个tensor_parallel_size可以设定为GPU数2,4,8实现模型分片并行。我这里意思是tensor_parallel_size为1,但是GPU 每张卡都加载一整个模型,然后数据并行,同时评测一个任务的不同数据。最近我实现了该种功能,使用NumWorkerPartitioner。以下为关键参数配置:有需要的可以借鉴。 @darrenglow 。同时感谢 @tonysy 。要是能尽快更新到文档就更好了。
@noforit 我是这样配的,但还是只有一张卡在跑,能帮我看看原因吗;
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=2),
runner=dict(
type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask))
)
models = [
dict(
type=VLLM,
abbr='qwen-7b-chat-vllm',
path="/home/zbl/data/llm/qwen/Qwen-7B-Chat",
model_kwargs=dict(tensor_parallel_size=1),
meta_template=_meta_template,
max_out_len=100,
max_seq_len=2048,
batch_size=100,
generation_kwargs=dict(temperature=0),
end_str='<|im_end|>',
)
]
@IcyFeather233 我知道你的意思,tensor_parallel_size
参数可以设置多卡推理,但我试了下使用多卡推理速度并没有比单卡变快;
所以我想实现的是多个任务并行推理:比如我有n个任务,同时用m个模型,每个模型执行一个任务的推理;
@Zbaoli 我看你的参数和我 差了一个
加个这个试试?
@noforit 谢谢你的回复,但我在models的配置中加了run_cfg=dict(num_gpus=1, num_proces=1)
参数之后还是只有一个 gpu 在运行;
@Zbaoli 奇怪😂。在程序运行前 加上 CUDA_VISIBLE_DEVICES 呢
或者你在/opencompass/opencompass/runners/local.py 里面调试一下?里面会自动检测显卡数量啥的
加个微信?我发你邮件
@IcyFeather233 谢谢你😂,我明白这个tensor_parallel_size可以设定为GPU数2,4,8实现模型分片并行。我这里意思是tensor_parallel_size为1,但是GPU 每张卡都加载一整个模型,然后数据并行,同时评测一个任务的不同数据。最近我实现了该种功能,使用NumWorkerPartitioner。以下为关键参数配置:有需要的可以借鉴。 @darrenglow 。同时感谢 @tonysy 。要是能尽快更新到文档就更好了。
这里使用了NumWorkerPartitioner后,数据集被拆分成了8份,但最终的summary没法将拆分后的数据集的指标结果汇总在一起,请问您会这样吗?
@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.
我请教下,opencompass提供的Sizepartitioner不就可以对数据集进行切割么?还是说NumWorkerPartitioner的partition方式要更高效一些?
@liushz Thank you for your response; I appreciate your clarification. However, the parameter in your reply pertains to setting tensor parallelism in vLLM. My intention is to load the entire model onto each of the eight GPUs, thereby distributing tasks in parallel across these GPUs. This approach should theoretically yield an eightfold acceleration in evaluation speed.
我请教下,opencompass提供的Sizepartitioner不就可以对数据集进行切割么?还是说NumWorkerPartitioner的partition方式要更高效一些?
size partitioner和numworker partitioner是两种不同的切分方式,一个是按给定的size切分,一个是按照卡的数目切分
当使用vllm的时候 不知道为什么一直报timeout 上面部分是模型的设置 下面的是错误 请问是怎么回事啊?