Lyu Han comments

Results 265 comments of


                                            Lyu Han

按照说明推理的时候报错

PyTorch has to be installed.

[Bug] internlm2 不能使用llama.cpp量化转换

The architecture of InternLM2 is different from InternLM. The former adopts GQA and has no attention bias. Unlike other GQA models, it packed q, k, v weights into one tensor....

Run the Codellama model quickly based on the Hugging Face pretrained model

Hi, @hyhhyh402 have you submitted a PR?

when fastertransformer support continuous batching and PagedAttention ?

Based on FasterTransformer, we have implemented an efficient inference engine - [TurboMind](https://github.com/InternLM/lmdeploy#introduction) - It supports llama and llama-2 - It modeled the inference of a conversational LLM as a persistently...

when fastertransformer support continuous batching and PagedAttention ?

> > Based on FasterTransformer, we have implemented an efficient inference engine - [TurboMind](https://github.com/InternLM/lmdeploy#introduction) > > > > * It supports llama and llama-2 > > * It modeled the...

Are MQA and GQA in development?

GQA has been supported by [LMDeploy](https://github.com/InternLM/lmdeploy), which is developed based on FasterTransformer

Are MQA and GQA in development?

Only 70B model is GQA.

[Doc] Add `projects` section in README which is developed based on FasterTransformer

@AnyangAngus GQA in LMDeploy/TurboMind doesn't distinguish between 7B, 13B, or 70B models. But as far as I know, [llama-2-7b/13b](https://huggingface.co/meta-llama/Llama-2-70b) doesn't have GQA block ![image](https://github.com/NVIDIA/FasterTransformer/assets/4560679/bdc25fcd-3857-434e-a0f5-f20b08a6f951)

How to deploy InternVL?

LMDeploy v0.4.1 can help deploying InternVL. This is a guide https://github.com/OpenGVLab/InternVL/pull/152

How to deploy InternVL?

When deploying VLMs on GPUs with limited memory, such as the V100 and T4, it is typically essential to utilize multiple GPUs because of the large size of the model....