Langchain-Chatchat 知识库问答

问题描述 / Problem Description 进行知识库问答时，本地部署qwen-1.5-7B-chat, qwen-1.5-14B-chat，回答效果不好，使用qwen-1.5-72B-chat 运行又太慢了。

想请教一下各位大佬，有好用的LLM模型推荐不？

Apr 25 '24 07:04 zixiaotan21

对于知识库问答的LLM模型选择，除了你已尝试的qwen系列外，还有多种模型可供选择，这些模型可能会在性能和运行速度上更适合你的需求。推荐尝试以下几个模型：

"internlm2-chat-20b" 来自 internlm 项目。
"Orion-14B-Chat-Plugin" 来自 OrionStarAI。
"Yi-34B-Chat" 来自 01-ai。
"falcon-40b" 来自 tiiuae。
"agentlm-70b" 来自 THUDM。
"baichuan2-13b-chat" 来自 baichuan-inc。
"open_llama_13b" 来自 openlm-research。
"vicuna-13b-v1.5" 来自 lmsys。
"mpt-30b" 来自 mosaicml。
"opt-66b" 和 "opt-iml-max-30b" 来自 Facebook。

为了提高模型的运行速度，可以考虑优化配置设置。例如，通过设置LLM_DEVICE为"auto"或"cuda"（如果有NVIDIA GPU），可以让模型在最佳硬件上运行。此外，调整MAX_TOKENS参数可以根据你的具体需求和硬件能力来优化性能，减少token数量可以加快响应时间。

希望这些建议能帮助你找到适合你需求的LLM模型。

Details

Sources

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Apr 25 '24 07:04 dosubot[bot]

@dosu

MAX_TOKENS参数如何设置

Apr 25 '24 07:04 zixiaotan21

To set the MAX_TOKENS parameter effectively, balancing between response time and answer quality, consider the following:

Model and Hardware Limits: Choose a MAX_TOKENS value that your hardware can handle efficiently, especially if using larger models like qwen-1.5-14B-chat. Smaller models or better hardware allow for higher MAX_TOKENS.
Response Time vs. Quality:
- For quicker responses, reduce MAX_TOKENS. This decreases computation but may limit response depth.
- For more detailed answers, increase MAX_TOKENS. This requires more processing time.
Optimization Techniques: Use model optimization strategies like quantization and pruning to improve processing times without greatly affecting quality.
Dynamic Adjustment: Implement logic to adjust MAX_TOKENS based on query complexity or system load, optimizing for performance or detail as needed.
Monitoring and Adjustments: Continuously monitor performance and adjust MAX_TOKENS based on feedback to find the optimal balance for your application.

Start with a MAX_TOKENS value that aligns with your hardware capabilities and adjust based on performance observations and application needs.

Details

Sources

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Apr 25 '24 08:04 dosubot[bot]

这大概率不是模型的问题，是你知识库组成和配置的问题

Apr 25 '24 08:04 Zephyr69

这大概率不是模型的问题，是你知识库组成和配置的问题

我的知识库是下面这样的：

# 问题1

## 问题
XXX

## 回答
XXX

#问题2

## 问题
XXX

## 回答
XXX

text splitter 算法用MarkdownHeaderTextsplitter，可以将不同的问答对，QA1，QA2切分为不同的向量。

实际提问问题1的时候，也确实能将答案1 检索出来，但回答的时候，没有完全按照答案1来回答，会加入一些编造的内容，这是我的模型推理能力不足吗？

配置的话，我是用4090显卡去跑的，现在用的模型是qwen-1.5-14B-chat 。

Apr 25 '24 08:04 zixiaotan21

这大概率不是模型的问题，是你知识库组成和配置的问题

我的知识库是下面这样的：
# 问题1

## 问题
XXX

## 回答
XXX

#问题2

## 问题
XXX

## 回答
XXX
text splitter 算法用MarkdownHeaderTextsplitter，可以将不同的问答对，QA1，QA2切分为不同的向量。

实际提问问题1的时候，也确实能将答案1 检索出来，但回答的时候，没有完全按照答案1来回答，会加入一些编造的内容，这是我的模型推理能力不足吗？

配置的话，我是用4090显卡去跑的，现在用的模型是qwen-1.5-14B-chat 。

模型会添加自主成分的话，可以把这种模式下的温度调低，0.01或0.1之类的。

Apr 25 '24 08:04 Zephyr69

这大概率不是模型的问题，是你知识库组成和配置的问题

我的知识库是下面这样的：
# 问题1

## 问题
XXX

## 回答
XXX

#问题2

## 问题
XXX

## 回答
XXX
text splitter 算法用MarkdownHeaderTextsplitter，可以将不同的问答对，QA1，QA2切分为不同的向量。实际提问问题1的时候，也确实能将答案1 检索出来，但回答的时候，没有完全按照答案1来回答，会加入一些编造的内容，这是我的模型推理能力不足吗？

配置的话，我是用4090显卡去跑的，现在用的模型是qwen-1.5-14B-chat 。
模型会添加自主成分的话，可以把这种模式下的温度调低，0.01或0.1之类的。

谢谢建议！确实将temperature调低，效果会稍微好一点~

Apr 25 '24 08:04 zixiaotan21

Langchain-Chatchat
Langchain-Chatchat copied to clipboard

知识库问答—LLM模型求推荐

Details

Details

Langchain-Chatchat Langchain-Chatchat copied to clipboard

知识库问答—LLM模型求推荐

Details

Details

Langchain-Chatchat
Langchain-Chatchat copied to clipboard