施庆章 comments

Results 25 comments of


                                            施庆章

trafficstars

Use OnnxRuntime IO Binding to improve GPU inference performance

你好! 请问现在GPU上进行T5的推理，是不是onnxruntime和pytorch的速度都差不多啊？在decode的过程中，每次decode的结果past value都很大，用onnxruntime推理怎么减少IO呢？ Hello! Now that the inference of T5 on the GPU is similar, is the speed of onnxruntime and pytorch similar? In the process of decode, the past...

[question 如果我想得到一个垂直领域的chat量化模型，是用c4的数据集构造好还是用垂直领域的数据集好？如何自己构建量化数据集，要将prompt后的输入和输出都放进去吗？要多少数据合适？]

I guess there will not be much differenc.

[Question]Any documents for the feature "prefix caching?"

> @byshiue Hi, i tested here. > > https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/runtime/model_runner_cpp.py#L166 > > and i add `enable_block_reuse=True,` for `KvCacheConfig` building process, but the profiles shows that it's same between `enable_block_reuse=True` and `enable_block_reuse=False`....

[Question]Any documents for the feature "prefix caching?"

Some models such as chatglm3-6b, did not obtain expected result compare to the huggingface version.

I guess the chat_input(tokenizer.build_chat_input("你是谁？", history=[], role="user").input_ids.cuda()) is different with input_ids (tokenizer("你是谁？", return_tensors="pt").input_ids.cuda()).

Speed between gptq w4a16 and awq w4a16?

> Try gptq and awq quantization of Mixtral-8x7B-Instruct-v0.1 got quite different performance. > > ## GPU > A40 48G VRAM > > ## vLLM version > `0.2.6` > > The...

Speed between gptq w4a16 and awq w4a16?

I try gptq 、sqeezellm and awq quantization of llama7b, got quite different performance. GPU A100 80G VRAM vLLM version new version AWQ ![image](https://github.com/vllm-project/vllm/assets/44834482/78b00a29-88d6-4391-ad43-60a24da11a32) gptq ![image](https://github.com/vllm-project/vllm/assets/44834482/cfe25dfb-95d6-49f6-832a-e89eebee790a) sqeezellm ![image](https://github.com/vllm-project/vllm/assets/44834482/5e1d321d-11ad-4440-9b79-eab1d5893abc) fp16 ![image](https://github.com/vllm-project/vllm/assets/44834482/4dc0077a-a649-4f10-9685-a9d1b2ca9ed6)

施庆章

Use OnnxRuntime IO Binding to improve GPU inference performance

[question 如果我想得到一个垂直领域的chat量化模型，是用c4的数据集构造好还是用垂直领域的数据集好？如何自己构建量化数据集，要将prompt后的输入和输出都放进去吗？要多少数据合适？]

[Question]Any documents for the feature "prefix caching?"

[Question]Any documents for the feature "prefix caching?"

Some models such as chatglm3-6b, did not obtain expected result compare to the huggingface version.

Speed between gptq w4a16 and awq w4a16?

Speed between gptq w4a16 and awq w4a16?

Speed between gptq w4a16 and awq w4a16?

T5 Beam Search Answer wrong

T5 Beam Search Answer wrong