Why is the generation time so long?

Open wjy3326 opened this issue 2 years ago • 0 comments

I use the openbuddy llama 7B model for generation. The first generation of 269 tokens took 5.9 seconds, the second generation of 550 tokens took 12.3 seconds, but there was no result generated in the third attempt. Why is the generation time so slow, and is there an issue with my parameter settings? Additionally, why was there no result generated in the third attempt? Below is my code and the generated result.

The code:

encoding: utf-8

import time from vllm import LLM, SamplingParams prompts = [ "介绍一下北京。", # "The president of the United States is", # "The capital of France is", # "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=800) print("sampling_params", sampling_params) llm = LLM(model="/media/odin/software/PycharmProjects/OpenBuddy-main/model/openbuddy-openllama-7b-v5-fp16/", gpu_memory_utilization=0.5) time1 = time.time()

gpu_memory_utilization: vllm模型使用的gpu的比率。

outputs = llm.generate(prompts, sampling_params)

for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}", len(generated_text)) print("time1", time.time()-time1)

time2 = time.time() outputs2 = llm.generate("介绍一下北京。", sampling_params) for output in outputs2: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}", len(generated_text))

time3 = time.time() print("time1", time3-time2)

outputs2 = llm.generate("介绍一下北京。", sampling_params) for output in outputs2: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}", len(generated_text))

time4 = time.time() print("time1", time4-time3)

The generated result: sampling_params SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=0.8, top_p=0.95, top_k=-1, use_beam_search=False, stop=[], ignore_eos=False, max_tokens=800, logprobs=None) INFO 07-04 21:44:55 llm_engine.py:59] Initializing an LLM engine with config: model='/media/odin/software/PycharmProjects/OpenBuddy-main/model/openbuddy-openllama-7b-v5-fp16/', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0) INFO 07-04 21:44:55 tokenizer_utils.py:22] OpenLLaMA models do not support the fast tokenizer. Using the slow tokenizer instead. INFO 07-04 21:45:00 llm_engine.py:128] # GPU blocks: 1331, # CPU blocks: 512 Processed prompts: 100%|██████████| 1/1 [00:05<00:00, 5.91s/it] Processed prompts: 0%| | 0/1 [00:00<?, ?it/s]Prompt: '介绍一下北京。', Generated text: ' \n\n北京是中华人民共和国的首都和最大城市，也是全球第二大城市。它位于中国中部，北京市位于北京市区，北京市区是中国的政治、文化和经济中心之一，也是全球重要的国际政治、经济和文化中心。北京市的总面积是 16,410 平方公里，其中城区面积为 6,863 平方公里。北京市拥有悠久的历史和丰富的文化遗产，包括故宫、天坛、颐和园、长城等。此外，北京还是中国的政治、文化、教育和科技中心，许多国内外知名的高校和科研机构都设在北京。\n\n北京市的气候属于温带大陆性气候，四季分明，春秋温和，夏季炎热，冬季寒冷。全年平均气温约为 14℃左右。\n' 269 time1 5.915788412094116 Prompt: '介绍一下北京。', Generated text: ' \n\n北京是中国的首都，也是全球人口最多的城市之一。它坐落在中国的中部，是华北地区的核心城市。北京有着悠久的历史和丰富的文化遗产，是中国的文化和政治中心。此外，北京还是一个现代化的城市，拥有现代化的交通、通讯、教育、医疗等基础设施。\n\n北京有着悠久的历史，可以追溯到公元前 1045 年，是中国古代的历史文化名城之一。在明清两代，北京成为中国的政治、文化和军事中心，是中国的古都之一。随着时间的推移，北京经历了许多变革和发展，成为现代化的城市。\n\n北京有着丰富的文化遗产和旅游资源。其中最著名的是故宫博物院，它是中国古代皇宫的遗址，也是世界上最大的古代宫殿建筑群之一。此外，北京还有很多历史遗迹和文化景点，如天安门广场、长城、颐和园、故宫博物院附近的古街道等。\n\n北京还是一个现代化的城市，拥有现代化的交通、通讯、教育、医疗等基础设施。它的高速公路和轨道交通系统使得交通便利，出行非常方便。此外，北京有许多著名的大学和高校，如清华大学、北京大学、中央医院等，是中国的知名高校之一。\n\n总之，北京是一个历史悠久、文化繁荣、现代化发达的城市，是中国的文化和政治中心。它不仅有着丰富的旅游资源，还拥有现代化的基础设施和高水平的教育和医疗机构。如果您有机会到访北京，一定不要错过这个充满魅力的城市。\n' 550 time1 12.32071328163147 Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.32s/it] Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 20.44it/s] Prompt: '介绍一下北京。', Generated text: ' \n' 6 time1 0.04918670654296875

thanks!

Jul 04 '23 14:07 wjy3326