Langchain-Chatchat
Langchain-Chatchat copied to clipboard
多卡推理模型回复乱码
问题描述 / Problem Description 使用多张卡共担显存时,回复乱码
环境信息 / Environment Information
- langchain-chatchat版本/commit 号:0.2.10
- 是否使用 Docker 部署(是/否):否
- 使用的模型(ChatGLM2-6B / Qwen-7B 等):ChatGLM3-6B-32k 、Qwen-72B-Int4-Chat
- 使用的 Embedding 模型(moka-ai/m3e-base 等):bge-large-zh-v1.5
- 使用的向量库类型 (faiss / milvus / pg_vector 等): faiss
- 操作系统及版本 / Operating system and version: ubuntu 20.04、openeuler 22.04 都试过不行
- Python 版本 / Python version: 3.11、3.10、3.9 都试过了
- 其他相关环境信息 / Other relevant environment information: 4090*3 cuda 12.1
首先回复界面是:
控制台界面是:
server_config.py的配置是:
只改动了这块
麻烦帮忙看下是怎么回事,需要修改哪些配置
单张卡没问题
Based on the information you've provided and the similar issues I found in the Langchain-Chatchat repository, it seems like the problem you're experiencing might be related to the configuration for multi-GPU support, particularly in the handling of tensor parallelism and memory allocation.
Here are some steps you can take to resolve this issue:
-
Tensor Parallel Size and Number of GPUs: Ensure that the
args.tensor_parallel_sizematches the number of GPUs you intend to use (args.num_gpus). This setting ensures that the model parallelism is correctly configured to distribute the workload evenly across the GPUs.if args.num_gpus > 1: args.tensor_parallel_size = args.num_gpus -
Memory Management: Check settings such as
args.swap_space,args.gpu_memory_utilization, andargs.max_gpu_memory. These are critical for managing how memory is allocated and used across GPUs. Ensure these are set to values that match your system's capabilities and the requirements of your model to prevent memory overflow or inefficient memory usage.args.swap_space = 4 # GiB args.gpu_memory_utilization = 0.90 args.max_gpu_memory = "22GiB" -
Concurrency and Batch Sizes: The
args.limit_worker_concurrencyandargs.max_num_batched_tokenssettings control how many concurrent operations can run and how large batches can be. Adjusting these values can help manage the load on each GPU and prevent garbled responses due to overloading.args.limit_worker_concurrency = 5 args.max_num_batched_tokens = None # Adjust based on your GPU capabilities
You can find these configurations in the startup.py file in the Langchain-Chatchat repository.
Regarding the infer_turbo parameter in the FastAPI server configuration, it is used to specify the inference acceleration framework to be used. When set to 'vllm', it indicates that the VLLM (a likely shorthand for a specific acceleration library or framework) is being used for inference acceleration. However, the configuration comments mention that using VLLM might lead to communication issues with HuggingFace, and it's also noted that support for some models with VLLM is not mature, hence it is disabled (False) by default. The effectiveness of infer_turbo in a multi-GPU context would likely depend on its compatibility with the specific models being used and how well it can distribute the computation across the available GPUs.
Please try these suggestions and let me know if they help resolve the issue. If the problem persists, it would be helpful to have more detailed error logs or screenshots to better understand and address the issue.
Sources
- [BUG] 问答回复乱码,项目能正常启动,匹配不到知识库
- [BUG] 单卡运行正常,多卡运行报错TypeError: Caught exception: object of type 'NoneType' has no len()
- 更新最新代码后,我也遇到了这个问题:TypeError: 'NoneType' object is not iterable
- [BUG] vllm多卡启动显示成功,但是实际上只有一张显卡被占用
- 使用Qwen-72B-Chat,启动服务,问答的时候发生:API通信遇到错误
- [BUG] 简洁阐述问题 / Concise description of the issue
- startup.py
- configs/server_config.py.example
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
有遇到这个情况的没?使用vllm多卡是可以的,fschat多卡就是乱码
同样遇到该情况+1,多卡部署要不就乱码或者后端报错
同样这个情况,官方docker部署。
但是在wsl2+cu118里面fschat多卡就正常
用vllm作为后端