alpaca_eval
alpaca_eval copied to clipboard
Why evaluate_from_model run so slow on my side
I am running with 8 A40 GPU and I think it should be fast. I set up the environment and run alpaca_eval evaluate_from_model --model_configs 'robin-v2-7b' --annotators_config 'claude'
and alpaca_eval evaluate_from_model --model_configs 'robin-v2-7b' --annotators_config 'alpaca_eval_gpt4'
but it takes a few days.
Also it is surprising that I didn't provide any API key but it still runs. Why is it? Thank you so much for you help!
Hi @Peanuttttttttt can you show the model_configs you are using? My guess is that all the time is spent in the generation and not eval, which would explain the slowness and why it is running despite not providing an API key. You haven't got the results at the end, right?
As to why it's slow my guess is that it's not using your GPUs, but I need to check the configs for that!
Wanted to quickly chime in and say the local model evaluation script isn't parallelized. The default uses device_map="auto"
, which would split the model up across the 8 gpus, but runs in model parallel, so that only 1 gpu is ever active at any given time. Given that the model seems to be a 7B model, it can actually fit on 1 gpu, so the communication overhead here will also slow down the results even further. I would suggested exposing only 1 gpu via CUDA_VISIBLE_DEVICES=0
and rerunning the command to see if that speeds things up.
@YannDubs I have checked again and I am using GPU. And I haven't got the result in the end. Here is my model_config:
prompt_template: "guanaco-7b/prompt.txt"
fn_completions: "huggingface_local_completions"
completions_kwargs:
model_name: "LMFlow/Full-Robin-7b-v2"
model_kwargs:
torch_dtype: 'bfloat16'
max_new_tokens: 1800
temperature: 0.7
top_p: 1.0
do_sample: True
pretty_name: "Robin 7b v2"
link: "https://huggingface.co/LMFlow/Full-Robin-7b-v2"
@rtaori Thanks! It speeds up, but it still takes about 10 hours
That seems roughly in the ballpark correct. I'm not sure how fast A40s are, but Alpaca 7B took around 3 arounds on 1 A100 gpu. Also, Alpaca responses tend to be shorter, so that reduces generation time as well. If Robin responses tend to be longer, it can significantly increase generation time. Your model config looks right, so I would suggest to wait it out and see.
I'm marking this issue as resolved now, but please open another issue if you experience any more issues.
Hi! @rtaori @YannDubs I am wondering if it is parallellize data when using evaluate_from_model
. That is, with 8 GPUs, I want to generate 8 responses at the same time, each on a different GPU. From the discussion above, it seems that this is not supported yet, and the recommended way is to use 1 GPU whenever we can fit the whole model on it. Is this interpretation correct?
Hi @liutianlin0121, this is not currently supported. Also the inference that we provide is generally pretty slow compared to current standards. If it's too slow for you, I would actually suggest doing inference with (TGI)[https://github.com/huggingface/text-generation-inference] to generate the outputs and then using alpaca_eval
directly on the outputs.