alpaca_eval icon indicating copy to clipboard operation
alpaca_eval copied to clipboard

Why evaluate_from_model run so slow on my side

Open hanningzhang opened this issue 1 year ago • 3 comments

I am running with 8 A40 GPU and I think it should be fast. I set up the environment and run alpaca_eval evaluate_from_model --model_configs 'robin-v2-7b' --annotators_config 'claude' and alpaca_eval evaluate_from_model --model_configs 'robin-v2-7b' --annotators_config 'alpaca_eval_gpt4' but it takes a few days. Also it is surprising that I didn't provide any API key but it still runs. Why is it? Thank you so much for you help!

hanningzhang avatar Jun 15 '23 17:06 hanningzhang

Hi @Peanuttttttttt can you show the model_configs you are using? My guess is that all the time is spent in the generation and not eval, which would explain the slowness and why it is running despite not providing an API key. You haven't got the results at the end, right?

As to why it's slow my guess is that it's not using your GPUs, but I need to check the configs for that!

YannDubs avatar Jun 16 '23 10:06 YannDubs

Wanted to quickly chime in and say the local model evaluation script isn't parallelized. The default uses device_map="auto", which would split the model up across the 8 gpus, but runs in model parallel, so that only 1 gpu is ever active at any given time. Given that the model seems to be a 7B model, it can actually fit on 1 gpu, so the communication overhead here will also slow down the results even further. I would suggested exposing only 1 gpu via CUDA_VISIBLE_DEVICES=0 and rerunning the command to see if that speeds things up.

rtaori avatar Jun 16 '23 12:06 rtaori

@YannDubs I have checked again and I am using GPU. And I haven't got the result in the end. Here is my model_config:

  prompt_template: "guanaco-7b/prompt.txt"
  fn_completions: "huggingface_local_completions"
  completions_kwargs:
    model_name: "LMFlow/Full-Robin-7b-v2"
    model_kwargs:
      torch_dtype: 'bfloat16'
    max_new_tokens: 1800
    temperature: 0.7
    top_p: 1.0
    do_sample: True
  pretty_name: "Robin 7b v2"
  link: "https://huggingface.co/LMFlow/Full-Robin-7b-v2"

@rtaori Thanks! It speeds up, but it still takes about 10 hours

hanningzhang avatar Jun 16 '23 17:06 hanningzhang

That seems roughly in the ballpark correct. I'm not sure how fast A40s are, but Alpaca 7B took around 3 arounds on 1 A100 gpu. Also, Alpaca responses tend to be shorter, so that reduces generation time as well. If Robin responses tend to be longer, it can significantly increase generation time. Your model config looks right, so I would suggest to wait it out and see.

rtaori avatar Jun 16 '23 20:06 rtaori

I'm marking this issue as resolved now, but please open another issue if you experience any more issues.

rtaori avatar Jun 16 '23 20:06 rtaori

Hi! @rtaori @YannDubs I am wondering if it is parallellize data when using evaluate_from_model. That is, with 8 GPUs, I want to generate 8 responses at the same time, each on a different GPU. From the discussion above, it seems that this is not supported yet, and the recommended way is to use 1 GPU whenever we can fit the whole model on it. Is this interpretation correct?

liutianlin0121 avatar Nov 23 '23 13:11 liutianlin0121

Hi @liutianlin0121, this is not currently supported. Also the inference that we provide is generally pretty slow compared to current standards. If it's too slow for you, I would actually suggest doing inference with (TGI)[https://github.com/huggingface/text-generation-inference] to generate the outputs and then using alpaca_eval directly on the outputs.

YannDubs avatar Nov 24 '23 08:11 YannDubs