gpt_academic icon indicating copy to clipboard operation
gpt_academic copied to clipboard

[Bug]: Qwen1.5-14B-chat 运行不了

Open hhbb979 opened this issue 1 year ago • 5 comments

Installation Method | 安装方法与平台

OneKeyInstall (一键安装脚本-windows)

Version | 版本

Latest | 最新版

OS | 操作系统

Windows

Describe the bug | 简述

Traceback (most recent call last): File ".\request_llms\local_llm_class.py", line 158, in run for response_full in self.llm_stream_generator(**kwargs): File ".\request_llms\bridge_qwen_local.py", line 46, in llm_stream_generator for response in self._model.chat_stream(self._tokenizer, query, history=history): ^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\GPT_academic371\Lib\site-packages\torch\nn\modules\module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'

Screen Shot | 有帮助的截图

Traceback (most recent call last): File ".\request_llms\local_llm_class.py", line 158, in run for response_full in self.llm_stream_generator(**kwargs): File ".\request_llms\bridge_qwen_local.py", line 46, in llm_stream_generator for response in self._model.chat_stream(self._tokenizer, query, history=history): ^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\GPT_academic371\Lib\site-packages\torch\nn\modules\module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'

Terminal Traceback & Material to Help Reproduce Bugs | 终端traceback(如有) + 帮助我们复现的测试材料样本(如有)

No response

hhbb979 avatar Feb 09 '24 01:02 hhbb979

qwen1.5移除了chat和chat_stream,可以参考https://qwen.readthedocs.io/en/latest/inference/chat.html 只要修改bridge_qwen_local.py中的llm_stream_generator即可

device = get_conf('LOCAL_MODEL_DEVICE')
system_prompt = get_conf('INIT_SYS_PROMPT')
def llm_stream_generator(self, **kwargs):
        def adaptor(kwargs):
            query = kwargs['query']
            max_length = kwargs['max_length']
            top_p = kwargs['top_p']
            temperature = kwargs['temperature']
            history = kwargs['history']
            return query, max_length, top_p, temperature, history

        query, max_length, top_p, temperature, history = adaptor(kwargs)

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ]
        text = self._tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        model_inputs = self._tokenizer([text], return_tensors="pt").to(device)

        from transformers import TextIteratorStreamer
        streamer = TextIteratorStreamer(self._tokenizer, skip_prompt=True, skip_special_tokens=True)
        
        from threading import Thread
        generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512)
        thread = Thread(target=self._model.generate, kwargs=generation_kwargs)
        thread.start()

        response = ""
        for new_text in streamer:
            response += new_text
            yield response

kaltsit33 avatar Mar 10 '24 14:03 kaltsit33

qwen1.5移除了chat和chat_stream,可以参考https://qwen.readthedocs.io/en/latest/inference/chat.html 只要修改bridge_qwen_local.py中的llm_stream_generator即可

device = get_conf('LOCAL_MODEL_DEVICE')
system_prompt = get_conf('INIT_SYS_PROMPT')
def llm_stream_generator(self, **kwargs):
        def adaptor(kwargs):
            query = kwargs['query']
            max_length = kwargs['max_length']
            top_p = kwargs['top_p']
            temperature = kwargs['temperature']
            history = kwargs['history']
            return query, max_length, top_p, temperature, history

        query, max_length, top_p, temperature, history = adaptor(kwargs)

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ]
        text = self._tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        model_inputs = self._tokenizer([text], return_tensors="pt").to(device)

        from transformers import TextIteratorStreamer
        streamer = TextIteratorStreamer(self._tokenizer, skip_prompt=True, skip_special_tokens=True)
        
        from threading import Thread
        generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512)
        thread = Thread(target=self._model.generate, kwargs=generation_kwargs)
        thread.start()

        response = ""
        for new_text in streamer:
            response += new_text
            yield response

qwen1.5-14b-chat 改成这个代码在1张V100上特别慢,输出结果差不多1秒一个字,GPU使用率100%,不清楚是模型原因还是代码原因 qwen-14b-chat 运行速度相当快,GPU利用率很低,输出很丝滑

zerotoone01 avatar Mar 21 '24 12:03 zerotoone01

qwen1.5移除了chat和chat_stream,可以参考https://qwen.readthedocs.io/en/latest/inference/chat.html 只要修改bridge_qwen_local.py中的llm_stream_generator即可

device = get_conf('LOCAL_MODEL_DEVICE')
system_prompt = get_conf('INIT_SYS_PROMPT')
def llm_stream_generator(self, **kwargs):
        def adaptor(kwargs):
            query = kwargs['query']
            max_length = kwargs['max_length']
            top_p = kwargs['top_p']
            temperature = kwargs['temperature']
            history = kwargs['history']
            return query, max_length, top_p, temperature, history

        query, max_length, top_p, temperature, history = adaptor(kwargs)

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ]
        text = self._tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        model_inputs = self._tokenizer([text], return_tensors="pt").to(device)

        from transformers import TextIteratorStreamer
        streamer = TextIteratorStreamer(self._tokenizer, skip_prompt=True, skip_special_tokens=True)
        
        from threading import Thread
        generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512)
        thread = Thread(target=self._model.generate, kwargs=generation_kwargs)
        thread.start()

        response = ""
        for new_text in streamer:
            response += new_text
            yield response

qwen1.5-14b-chat 改成这个代码在1张V100上特别慢,输出结果差不多1秒一个字,GPU使用率100%,不清楚是模型原因还是代码原因 qwen-14b-chat 运行速度相当快,GPU利用率很低,输出很丝滑

我也是,一秒蹦1个字,好慢,gpu占用满了,好奇怪

ZH-007 avatar Apr 14 '24 05:04 ZH-007

已测试qwen1.5的14b,32b,72b,用官方的transformers推理速度都很慢,建议使用vllm或者llama.cpp部署

kaltsit33 avatar Apr 20 '24 05:04 kaltsit33

10s出一个词,显存24G占满。qwen2和1.5测试都很慢,不知道啥原因

hejian41 avatar Jun 08 '24 11:06 hejian41