Langchain-Chatchat 多卡推理模型回复乱码

问题描述 / Problem Description 使用多张卡共担显存时，回复乱码

环境信息 / Environment Information

langchain-chatchat版本/commit 号：0.2.10
是否使用 Docker 部署（是/否）：否
使用的模型（ChatGLM2-6B / Qwen-7B 等）：ChatGLM3-6B-32k 、Qwen-72B-Int4-Chat
使用的 Embedding 模型（moka-ai/m3e-base 等）：bge-large-zh-v1.5
使用的向量库类型 (faiss / milvus / pg_vector 等)： faiss
操作系统及版本 / Operating system and version: ubuntu 20.04、openeuler 22.04 都试过不行
Python 版本 / Python version: 3.11、3.10、3.9 都试过了
其他相关环境信息 / Other relevant environment information: 4090*3 cuda 12.1

首先回复界面是：控制台界面是：

server_config.py的配置是：只改动了这块

麻烦帮忙看下是怎么回事，需要修改哪些配置

Mar 07 '24 02:03 Wimeremce7

单张卡没问题

Mar 07 '24 02:03 Wimeremce7

Based on the information you've provided and the similar issues I found in the Langchain-Chatchat repository, it seems like the problem you're experiencing might be related to the configuration for multi-GPU support, particularly in the handling of tensor parallelism and memory allocation.

Here are some steps you can take to resolve this issue:

Tensor Parallel Size and Number of GPUs: Ensure that the args.tensor_parallel_size matches the number of GPUs you intend to use (args.num_gpus). This setting ensures that the model parallelism is correctly configured to distribute the workload evenly across the GPUs.
```
if args.num_gpus > 1:
    args.tensor_parallel_size = args.num_gpus
```
Memory Management: Check settings such as args.swap_space, args.gpu_memory_utilization, and args.max_gpu_memory. These are critical for managing how memory is allocated and used across GPUs. Ensure these are set to values that match your system's capabilities and the requirements of your model to prevent memory overflow or inefficient memory usage.
```
args.swap_space = 4  # GiB
args.gpu_memory_utilization = 0.90
args.max_gpu_memory = "22GiB"
```
Concurrency and Batch Sizes: The args.limit_worker_concurrency and args.max_num_batched_tokens settings control how many concurrent operations can run and how large batches can be. Adjusting these values can help manage the load on each GPU and prevent garbled responses due to overloading.
```
args.limit_worker_concurrency = 5
args.max_num_batched_tokens = None  # Adjust based on your GPU capabilities
```

You can find these configurations in the startup.py file in the Langchain-Chatchat repository.

Regarding the infer_turbo parameter in the FastAPI server configuration, it is used to specify the inference acceleration framework to be used. When set to 'vllm', it indicates that the VLLM (a likely shorthand for a specific acceleration library or framework) is being used for inference acceleration. However, the configuration comments mention that using VLLM might lead to communication issues with HuggingFace, and it's also noted that support for some models with VLLM is not mature, hence it is disabled (False) by default. The effectiveness of infer_turbo in a multi-GPU context would likely depend on its compatibility with the specific models being used and how well it can distribute the computation across the available GPUs.

Please try these suggestions and let me know if they help resolve the issue. If the problem persists, it would be helpful to have more detailed error logs or screenshots to better understand and address the issue.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.