是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

需要怎么样才能并发呢？目前是一台物理机 24G 显卡，虽然资源不多，但希望能够实现起码2个并发吧~~ 微信截图_20240619180359 微信截图_20240619180246

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:3.10
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1

备注 | Anything else?

No response

Jun 19 '24 10:06 geminizyz

请问你知道minicpm-llama3-v-2_5(int4)是使用哪种方式量化得到的吗

Jun 20 '24 03:06 1SingleFeng

请问你知道minicpm-llama3-v-2_5(int4)是使用哪种方式量化得到的吗

BnB

Jun 20 '24 09:06 weiminw

请问你知道minicpm-llama3-v-2_5(int4)是使用哪种方式量化得到的吗

BnB

请问是指 BitsAndBytes 吗

Jun 20 '24 09:06 1SingleFeng

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[x] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

需要怎么样才能并发呢？目前是一台物理机 24G 显卡，虽然资源不多，但希望能够实现起码2个并发吧~~

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment
- OS:
- Python:3.10
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1
备注 | Anything else?

No response

遇到同样的问题

Jun 20 '24 09:06 weiminw

怎么实现呢？

Jul 04 '24 06:07 chuangzhidan

在这个文档里https://modelbest.feishu.cn/wiki/O0KTwQV5piUPzTkRXl9cSFyHnQb?from=from_copylink import torch from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig from PIL import Image import time import torch import GPUtil import os

model_path = '/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5' # 模型下载地址 device = 'cuda' if torch.cuda.is_available() else 'cpu' save_path = '/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5_int4' # 量化模型保存地址 image_path = '/root/ld/ld_project/MiniCPM-V/assets/airplane.jpeg'

创建一个配置对象来指定量化参数

quantization_config = BitsAndBytesConfig( load_in_4bit= True, # 是否进行4bit量化 load_in_8bit=False, # 是否进行8bit量化 bnb_4bit_compute_dtype=torch.float16, # 计算精度设置 bnb_4bit_quant_storage=torch.uint8, # 量化权重的储存格式 bnb_4bit_quant_type="nf4", # 量化格式，这里用的是正太分布的int4 bnb_4bit_use_double_quant= True, # 是否采用双量化，即对zeropoint和scaling参数进行量化 llm_int8_enable_fp32_cpu_offload=False, # 是否llm使用int8，cpu上保存的参数使用fp32 llm_int8_has_fp16_weight=False, # 是否启用混合精度 llm_int8_skip_modules=[ "out_proj", "kv_proj", "lm_head" ], # 不进行量化的模块 llm_int8_threshold= 6.0 # llm.int8()算法中的离群值，根据这个值区分是否进行量化 )

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModel.from_pretrained( model_path, device_map="cuda:0", # 分配模型到GPU0 quantization_config=quantization_config, trust_remote_code=True ) gpu_usage = GPUtil.getGPUs()[0].memoryUsed

start=time.time() response = model.chat( image=Image.open(image_path).convert("RGB"), msgs=[ { "role": "user", "content": "这张图片中有什么?" } ], tokenizer=tokenizer ) # 模型推理 print('量化后输出',response) print('量化后用时',time.time()-start) print(f"量化后显存占用: {round(gpu_usage/1024,2)}GB")

保存模型和分词器

os.makedirs(save_path, exist_ok=True) model.save_pretrained(save_path, safe_serialization=True) tokenizer.save_pretrained(save_path)

Jul 12 '24 12:07 LDLINGLINGLING

请问跑minicpm-llama3-v-2_5(int4)支持并发调用接口么？2个及以上并发调用就报错了

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

创建一个配置对象来指定量化参数

保存模型和分词器