Langchain-Chatchat icon indicating copy to clipboard operation
Langchain-Chatchat copied to clipboard

多卡推理模型回复乱码

Open Wimeremce7 opened this issue 1 year ago • 3 comments

问题描述 / Problem Description 使用多张卡共担显存时,回复乱码

环境信息 / Environment Information

  • langchain-chatchat版本/commit 号:0.2.10
  • 是否使用 Docker 部署(是/否):否
  • 使用的模型(ChatGLM2-6B / Qwen-7B 等):ChatGLM3-6B-32k 、Qwen-72B-Int4-Chat
  • 使用的 Embedding 模型(moka-ai/m3e-base 等):bge-large-zh-v1.5
  • 使用的向量库类型 (faiss / milvus / pg_vector 等): faiss
  • 操作系统及版本 / Operating system and version: ubuntu 20.04、openeuler 22.04 都试过不行
  • Python 版本 / Python version: 3.11、3.10、3.9 都试过了
  • 其他相关环境信息 / Other relevant environment information: 4090*3 cuda 12.1

首先回复界面是: image 控制台界面是: image

server_config.py的配置是: image 只改动了这块 image

麻烦帮忙看下是怎么回事,需要修改哪些配置

Wimeremce7 avatar Mar 07 '24 02:03 Wimeremce7

单张卡没问题

Wimeremce7 avatar Mar 07 '24 02:03 Wimeremce7

Based on the information you've provided and the similar issues I found in the Langchain-Chatchat repository, it seems like the problem you're experiencing might be related to the configuration for multi-GPU support, particularly in the handling of tensor parallelism and memory allocation.

Here are some steps you can take to resolve this issue:

  1. Tensor Parallel Size and Number of GPUs: Ensure that the args.tensor_parallel_size matches the number of GPUs you intend to use (args.num_gpus). This setting ensures that the model parallelism is correctly configured to distribute the workload evenly across the GPUs.

    if args.num_gpus > 1:
        args.tensor_parallel_size = args.num_gpus
    
  2. Memory Management: Check settings such as args.swap_space, args.gpu_memory_utilization, and args.max_gpu_memory. These are critical for managing how memory is allocated and used across GPUs. Ensure these are set to values that match your system's capabilities and the requirements of your model to prevent memory overflow or inefficient memory usage.

    args.swap_space = 4  # GiB
    args.gpu_memory_utilization = 0.90
    args.max_gpu_memory = "22GiB"
    
  3. Concurrency and Batch Sizes: The args.limit_worker_concurrency and args.max_num_batched_tokens settings control how many concurrent operations can run and how large batches can be. Adjusting these values can help manage the load on each GPU and prevent garbled responses due to overloading.

    args.limit_worker_concurrency = 5
    args.max_num_batched_tokens = None  # Adjust based on your GPU capabilities
    

You can find these configurations in the startup.py file in the Langchain-Chatchat repository.

Regarding the infer_turbo parameter in the FastAPI server configuration, it is used to specify the inference acceleration framework to be used. When set to 'vllm', it indicates that the VLLM (a likely shorthand for a specific acceleration library or framework) is being used for inference acceleration. However, the configuration comments mention that using VLLM might lead to communication issues with HuggingFace, and it's also noted that support for some models with VLLM is not mature, hence it is disabled (False) by default. The effectiveness of infer_turbo in a multi-GPU context would likely depend on its compatibility with the specific models being used and how well it can distribute the computation across the available GPUs.

Please try these suggestions and let me know if they help resolve the issue. If the problem persists, it would be helpful to have more detailed error logs or screenshots to better understand and address the issue.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Mar 07 '24 02:03 dosubot[bot]

有遇到这个情况的没?使用vllm多卡是可以的,fschat多卡就是乱码

Wimeremce7 avatar Mar 13 '24 03:03 Wimeremce7

同样遇到该情况+1,多卡部署要不就乱码或者后端报错

hyxrpk avatar Mar 20 '24 06:03 hyxrpk

同样这个情况,官方docker部署。

但是在wsl2+cu118里面fschat多卡就正常

Zephyr69 avatar Apr 15 '24 13:04 Zephyr69

用vllm作为后端

zRzRzRzRzRzRzR avatar May 05 '24 03:05 zRzRzRzRzRzRzR