renne444 comments

Results 4 comments of


                                            renne444

INT8 Quantization Performance Issue with BERT-like Model

``` XLMRobertaModel( (embeddings): XLMRobertaEmbeddings( (word_embeddings): Embedding(250002, 768, padding_idx=1) (position_embeddings): Embedding(514, 768, padding_idx=1) (token_type_embeddings): Embedding(1, 768) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): XLMRobertaEncoder( (layer): ModuleList( (0): XLMRobertaLayer( (attention):...

[Bug]: Nvidia L20推理Qwen2.5 72B GPTQ-Int8模型不符合预期

@jklj077 十分感谢支持！我已经根据[Dockerfile](https://github.com/QwenLM/Qwen2.5/blob/main/docker/Dockerfile-cu121)里的版本信息，安装了对应版本的依赖库环境，安装了2.2.2版本的torch环境。并且在A100和L20上都分别使用先前提到的例程运行了0.5B、7B、72B版本的Qwen2.5模型，都出现了可以在vllm下可以跑通，但使用提供的[transformers样例](https://huggingface.co/Qwen/Qwen2-72B-Instruct-GPTQ-Int8#quickstart)跑不通的情况。并且这个现象在单卡和多卡都存在。我们A100机器驱动版本为525.105.17，L20机器驱动版本为535.161.07。在我们的云服务环境中，升级驱动可能会有很大的代价。请问是否可以用现有版本的驱动，在Ampere或Ada Lovelace架构中跑通提供的例程？环境： ``` Package Version ------------------------ ----------- accelerate 1.0.0 aiohappyeyeballs 2.4.3 aiohttp 3.10.9 aiosignal 1.3.1 async-timeout 4.0.3 attrs 24.2.0 auto_gptq 0.7.1 autoawq 0.2.5 autoawq_kernels 0.0.6...

[Bug]: Nvidia L20推理Qwen2.5 72B GPTQ-Int8模型不符合预期

@jklj077 感谢你的评论，我用你提到的方法，并且用`qwenllm/qwen:2-cu121`官方镜像测试了**qwen2.5 0.5B-GPTQ-Int8**的模型。在用`AutoGPTQForCausalLM`导入模型，并开启`use_triton`后，输出了`!!!!!!!!!!!!!!!!!!!!!!!!!!!`那样的异常信息。在hugging face框架上的测试，他们都能输出正常的结果，只不过GPTQ-Int8模型在推理时会出现提示`CUDA extension not installed.`，并且有较慢的推理速度，问题还是无法解决。以下是我的详细代码，以及每一项的输出结果，以下代码都在`qwenllm/qwen:2-cu121`镜像中运行，并且操作系统和硬件环境与上面提到的保持一致： ```python from transformers import AutoModelForCausalLM, AutoTokenizer from auto_gptq import AutoGPTQForCausalLM import os # 获取模型的绝对路径 int8_model_path = "/Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8" bf16_model_path = "/Qwen/Qwen2.5-0.5B-Instruct" os.environ['CUDA_VISIBLE_DEVICES'] =...

[Bug]: Nvidia L20推理Qwen2.5 72B GPTQ-Int8模型不符合预期

@jklj077 在docker镜像里用vllm==0.4.3跑是符合预期的，另外还尝试过vllm==0.6.1 post1都没有问题。不管是72B还是0.5B都能推理出期望的结果。只是，因为有很多开源工具都依赖了hugging face实现，所以还是希望能够解决这个问题。下面是docker镜像里的输出，用官方提供的[离线推理代码](https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html#offline-batched-inference)跑的，日志看起来是正常的： ```text root@0539ee11880b:/code/llm_deploy/demo# python vllm_generate.py Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 11-08 06:44:27...