施庆章

Results 25 comments of 施庆章
trafficstars

你好! 请问现在GPU上进行T5的推理,是不是onnxruntime和pytorch的速度都差不多啊?在decode的过程中,每次decode的结果past value都很大,用onnxruntime推理怎么减少IO呢? Hello! Now that the inference of T5 on the GPU is similar, is the speed of onnxruntime and pytorch similar? In the process of decode, the past...

> @byshiue Hi, i tested here. > > https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/runtime/model_runner_cpp.py#L166 > > and i add `enable_block_reuse=True,` for `KvCacheConfig` building process, but the profiles shows that it's same between `enable_block_reuse=True` and `enable_block_reuse=False`....

> @byshiue Hi, i tested here. > > https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/runtime/model_runner_cpp.py#L166 > > and i add `enable_block_reuse=True,` for `KvCacheConfig` building process, but the profiles shows that it's same between `enable_block_reuse=True` and `enable_block_reuse=False`....

I guess the chat_input(tokenizer.build_chat_input("你是谁?", history=[], role="user").input_ids.cuda()) is different with input_ids (tokenizer("你是谁?", return_tensors="pt").input_ids.cuda()).

> Try gptq and awq quantization of Mixtral-8x7B-Instruct-v0.1 got quite different performance. > > ## GPU > A40 48G VRAM > > ## vLLM version > `0.2.6` > > The...

I try gptq 、sqeezellm and awq quantization of llama7b, got quite different performance. GPU A100 80G VRAM vLLM version new version AWQ ![image](https://github.com/vllm-project/vllm/assets/44834482/78b00a29-88d6-4391-ad43-60a24da11a32) gptq ![image](https://github.com/vllm-project/vllm/assets/44834482/cfe25dfb-95d6-49f6-832a-e89eebee790a) sqeezellm ![image](https://github.com/vllm-project/vllm/assets/44834482/5e1d321d-11ad-4440-9b79-eab1d5893abc) fp16 ![image](https://github.com/vllm-project/vllm/assets/44834482/4dc0077a-a649-4f10-9685-a9d1b2ca9ed6)

> In @shiqingzhangCSU bench AWQ is also faster (though a bit less so, which might be understandable given it's a smaller model). I wonder why @shiqingzhangCSU sees worse throughput for...

> Hi, would you like to post your calling code (the calling code of fastertransformer and huggingface transformer), I'll refer to it and check my script again, thank you very...

> 多大的模型,似乎生成式模型HF和FT就是会不一样,你不一样的token多吗?