graphrag [Feature Request]: 在运行前新增一个提问优化器

Do you need to file an issue?

[X] I have searched the existing issues and this feature is not already filed.
[ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[X] I believe this is a legitimate feature request, not just a question. If this is a question, please use the Discussions area.

Is your feature request related to a problem? Please describe.

** 前言：感谢Graphrag项目组的开源付出，这让我的工作得到了飞跃性质的效率提升！包括开发过程文档和规章文档，借助于Graphrag的能力，不仅对于个人，更是对于团队，让我们实质上得到了普惠。

** 功能建议原因：在生产运行了一段时间后，我们发现graphrag对提问中的很多显式的关系没有能够被正确提取如下例：

4499始发航班A330机型，航班搭载要客联合国秘书长，预计起飞时间14:00，航班机组情况{机长：张三，岗位：责任机长，等级：教员，乘务长：李四，等级：科长，空保组长：王五，等级：科长}

在该例子中，我通过 prompt_tune 成功从文档里提取了 4499作为航班号始发航班作为航班类型，A330机型作为飞机类型，要客作为乘客类型，联合国秘书长作为要客级别等实体。但在提问中local和global均忽视了始发航班、联合国秘书长、空保组长三个实体。这在运行中很意外。

Describe the solution you'd like

通过进一步实验发现，当提问的实体种类超过5个时（或提问长度达到100字以上时），graphrag极易发生实体丢失的情况，普遍为2~4个实体。因此，我基于构建实体的entity_extraction的思路，验证了一个新思路：通过优化用户提问来实现更好更准取得查询效果，经过优化的内容大概是这样的：

-Examples- ######################

Example 1:

user question: 4499始发航班，属于A330机型，预计起飞时间14:00，目前机组到位时间:13:10，是否符合航班保障标准要求?

output: 请根据以下信息判断是否符合航班保障标准要求：航班为始发航班，航班号为4499，机型为A330，预计起飞时间为14:00，机组到位时间为13:10。请提供详细的评估和结论。 ######################

我在工作中已经实装了这些优化。这在长查询中至少提高了31%的命中率，并且极大的改善了回复的期望质量。

我将这个实践作为功能请求分享给你们，希望能够普惠到更多人。

Additional context

No response

Aug 19 '24 07:08 shaoqing404

有更多实验相关的细则，如果你们感兴趣，我愿意分享。

Aug 19 '24 07:08 shaoqing404

请问您的summarized_description.txt中是否有相关实例，生成的create_summarized_entities.parquet中是否有内容

Aug 19 '24 07:08 yangxue-1

有更多实验相关的细则，如果你们感兴趣，我愿意分享。

其他的优化实例可以再列举一两个吗？

另外，您是否设计了分场景的prompt进行优化？

Aug 19 '24 08:08 yangxue-1

请问您的summarized_description.txt中是否有相关实例，生成的create_summarized_entities.parquet中是否有内容

我不太理解你的意思。在summarized_description.txt 是我手动修改的，补充了少量示例（3个），create_summarized_entities.parquet中存在内容

Aug 20 '24 02:08 shaoqing404

请问您的summarized_description.txt中是否有相关实例，生成的create_summarized_entities.parquet中是否有内容

我不太理解你的意思。在summarized_description.txt 是我手动修改的，补充了少量示例（3个），create_summarized_entities.parquet中存在内容

请问大佬，prompt_tune后生成的四个模板，是否都手动修改过？能否提供一下修改后的模板以供参考？

Aug 20 '24 04:08 KDD2018

请问您的summarized_description.txt中是否有相关实例，生成的create_summarized_entities.parquet中是否有内容

我不太理解你的意思。在summarized_description.txt 是我手动修改的，补充了少量示例（3个），create_summarized_entities.parquet中存在内容

可以分享一下示例的格式吗？

Aug 20 '24 06:08 yangxue-1

请问您的summarized_description.txt中是否有相关实例，生成的create_summarized_entities.parquet中是否有内容

我不太理解你的意思。在summarized_description.txt 是我手动修改的，补充了少量示例（3个），create_summarized_entities.parquet中存在内容

请问大佬，prompt_tune后生成的四个模板，是否都手动修改过？能否提供一下修改后的模板以供参考？

没有通用东西，你需要修改 entities和summarize，根据你的实际情况进行改写，这一步的目的是构建符合你文档的实体提取以entities为例：甲：修改step 1 的entity_type，把你认为可行的类型加上去，比如 - entity_type: Suggest several labels or categories for the entity. The categories should not be specific, but should be as general as possible.These are some reference entity types:[ORGANIZATION, PERSON, GEO, EVENT, EQUIPMENT, FACILITIES, DEPARTMENT, ROLE, POSITION, OPERATION, TASK or more]

乙：然后把整个提示词连同生成的示例一复制丢给deepseek或gpt4o（不要使用4omini，表现很差），从你的文档中抄出几个你认为比较典型的或比较好的段落，提取实体和关系，动手确认一下是否符合你的需求，是否找出了你想要的隐式关系。

丙：构建好的示例回填提示文本，大概是这样。

丁：开始索引构建

Aug 20 '24 07:08 shaoqing404

请问您的summarized_description.txt中是否有相关实例，生成的create_summarized_entities.parquet中是否有内容

我不太理解你的意思。在summarized_description.txt 是我手动修改的，补充了少量示例（3个），create_summarized_entities.parquet中存在内容

可以分享一下示例的格式吗？

You are an expert in aviation operations and regulatory compliance. You are skilled at interpreting and analyzing complex normative documents to understand operational frameworks and community structures. You are adept at helping people with identifying the relations and structure of the community of interest within domains such as flight operations, procedures, and management, particularly in the context of airline operations centers (AOC). Using your expertise, you're asked to generate a comprehensive summary of the data provided below. Given one or two entities, and a list of descriptions, all related to the same entity or group of entities. Please concatenate all of these into a single, concise description in Chinese. Make sure to include information collected from all the descriptions. If the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary. Make sure it is written in third person, and include the entity names so we have the full context.

Enrich it as much as you can with relevant information from the nearby text, this is very important.

If no answer is possible, or the description is empty, only convey information that is provided within the text. -Examples- ######################

Example 1:

Entities: {entity_name} Description List: {description_list} ######################

####### -Data- Entities: {entity_name} Description List: {description_list} ####### Output:

Aug 20 '24 07:08 shaoqing404

@shaoqing404 感谢大佬的指点，claim和community这两个模板您修改了么？

Aug 20 '24 08:08 KDD2018

请问您的summarized_description.txt中是否有相关实例，生成的create_summarized_entities.parquet中是否有内容

我不太理解你的意思。在summarized_description.txt 是我手动修改的，补充了少量示例（3个），create_summarized_entities.parquet中存在内容

可以分享一下示例的格式吗？

You are an expert in aviation operations and regulatory compliance. You are skilled at interpreting and analyzing complex normative documents to understand operational frameworks and community structures. You are adept at helping people with identifying the relations and structure of the community of interest within domains such as flight operations, procedures, and management, particularly in the context of airline operations centers (AOC). Using your expertise, you're asked to generate a comprehensive summary of the data provided below. Given one or two entities, and a list of descriptions, all related to the same entity or group of entities. Please concatenate all of these into a single, concise description in Chinese. Make sure to include information collected from all the descriptions. If the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary. Make sure it is written in third person, and include the entity names so we have the full context.

Enrich it as much as you can with relevant information from the nearby text, this is very important.

If no answer is possible, or the description is empty, only convey information that is provided within the text. -Examples- ######################

Example 1:

Entities: {entity_name} Description List: {description_list} ######################

####### -Data- Entities: {entity_name} Description List: {description_list} ####### Output:

非常感谢

Aug 21 '24 01:08 yangxue-1

@shaoqing404 感谢大佬的指点，claim和community这两个模板您修改了么？

不做修改。这俩跟LCC有关。

Aug 21 '24 02:08 shaoqing404

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

Aug 30 '24 01:08 github-actions[bot]

This issue has been closed after being marked as stale for five days. Please reopen if needed.

Sep 04 '24 01:09 github-actions[bot]

[Feature Request]: 在运行前新增一个提问优化器

Do you need to file an issue?

Is your feature request related to a problem? Please describe.

** 功能建议原因：在生产运行了一段时间后，我们发现graphrag对提问中的很多显式的关系没有能够被正确提取 如下例：

4499始发航班A330机型，航班搭载要客联合国秘书长，预计起飞时间14:00，航班机组情况{机长：张三，岗位：责任机长，等级：教员，乘务长：李四，等级：科长，空保组长：王五，等级：科长}

Describe the solution you'd like

user question: 4499始发航班，属于A330机型，预计起飞时间14:00，目前机组到位时间:13:10，是否符合航班保障标准要求?

Additional context

** 功能建议原因：在生产运行了一段时间后，我们发现graphrag对提问中的很多显式的关系没有能够被正确提取如下例：