ytxiong
ytxiong
这个问题看起来像是你没有正确安装flash_attn。你可以再确认一下你当前的运行环境中是否已经安装了flash_attn。
你运行import flash_attn能成功吗?这个rotary_emb是flash_attn里面的cuda扩展算子,这个应该和版本是没有关系的。理论上,flash_attn安装成功后,这个包是可以import成功的。
安装flash_attn可以参考[这个](https://github.com/InternLM/InternLM/blob/main/doc/en/install.md)
@zucchini-nlp thank you very much. I see in verl, it passes position_ids[0] to flash attention. I am not sure it is correct.
> Minimal code snippet: > > from vllm import LLM > llm = LLM( > model=YOUR_MODEL_PATH, > pipeline_parallel_size=2, > ) @jeejeelee Thank you. You mean [this](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py#L52)? I didn't see the...
> See: https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L112 ok Thank you very much, I will have a try
@jeejeelee I have tried to use pp in LLM API, however, I met the problem ``raise NotImplementedError( NotImplementedError: Pipeline parallelism is only supported through AsyncLLMEngine as performance will be severely...
Thank you. I am new to vllm, Is this online inference?
So, how about offline inference? can async engine be used in offline inference?
@jeejeelee Okay,thank you