lullabies777 issues

Results 7 issues of


                                            lullabies777

Add support for Gemma models

Gemma 7B: https://huggingface.co/google/gemma-7b Gemma 2B: https://huggingface.co/google/gemma-2b Blog: https://blog.google/technology/developers/gemma-open-models/ Paper: https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf

[model support] please support gemma2

I found that the scripts in GEMMA do not support GEMMA2. Is there any plan to add support for GEMMA2?

feature request

new model

How to disable KV cache for LLM

I quantized the Qwen2-0.5B model, which is approximately 800M. However, I will need 6GB GPU memory for inference, likely due to the KV cache. Can I disable the KV cache...

duplicate

[Bug] Why sglang is slower than vllm on ShareGPT datasets?

### Checklist - [X] 1. I have searched related issues but cannot get the expected help. - [X] 2. The bug has not been fixed in the latest version. -...

vllm无法自动选择tool

按照https://modelscope-agent.readthedocs.io/en/stable/llms/qwen2_tool_calling.html里面的教程，用vllm起了qwen2-7b，提示需要在vllm启动命令加--enable-auto-tool-choice，但是启动--enable-auto-tool-choice需要配套的--tool-call-parser，目前vllm只有mistral和hermes，这个怎么实现？ \"auto\" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set

Work in Progress

Inference speed so slow on T4

I tried running Nemo-12b 4-bit model on one T4 GPU, but the inference speed is very slow. Additionally, the 'forward' function takes much longer than 'generate'. Is there a speedup...

[Bug]: AsyncLLMEngine stuck in V1

### Your current environment INFO 06-20 10:55:55 [__init__.py:244] Automatically detected platform cuda. Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04)...

bug