graphrag [Feature Request]: Prompt Tuning with given entities

Do you need to file an issue?

[X] I have searched the existing issues and this feature is not already filed.
[X] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[X] I believe this is a legitimate feature request, not just a question. If this is a question, please use the Discussions area.

Is your feature request related to a problem? Please describe.

Prompt auto-tuning identifies too many entities on a diverse document copus like news.

Describe the solution you'd like

The solution would be to provide a set of desired entities to the auto-tuning, like already possible with "domain". Alternative solution would be to provide the user with a list of potential entities, that can be edited before creating the final prompt templates including the few-shot examples.

Additional context

No response

Aug 23 '24 13:08 DanielSchuhmacher

I face a similar issue with the entities extracted in the indexing step. Even though I specify a set of entity types in the config, when creating the graphs, more entities are extracted. I know that this comes from the LLMs taking the types more as a suggestion than a mandatory thing. At the moment I have a band-aid fix in place that filters the entities before adding them into the graph, but it would be nice if there would be a field in the config from which you can specify to graphRAG to only extract the provided entities.

Aug 27 '24 07:08 andreiionut1411

I face a similar issue with the entities extracted in the indexing step. Even though I specify a set of entity types in the config, when creating the graphs, more entities are extracted. I know that this comes from the LLMs taking the types more as a suggestion than a mandatory thing. At the moment I have a band-aid fix in place that filters the entities before adding them into the graph, but it would be nice if there would be a field in the config from which you can specify to graphRAG to only extract the provided entities.

How did you specify the the entity set in the config file? I need to specify some base entities before the prompt extracts entities on its own? How to do that?

Aug 29 '24 02:08 harshaharod21

you can manually prompt tune using gpt-4o such model.

Aug 29 '24 06:08 KylinMountain

How did you specify the the entity set in the config file? I need to specify some base entities before the prompt extracts entities on its own? How to do that?

In the settings.yaml there is the entity_extraction part that contains the entity_types field where you can specify the types of entities you want the LLM to extract, but they are only taken into consideration more as a suggestion when indexing and completely ignored when prompt tuning.

Aug 29 '24 07:08 andreiionut1411

How did you specify the the entity set in the config file? I need to specify some base entities before the prompt extracts entities on its own? How to do that?

In the settings.yaml there is the entity_extraction part that contains the entity_types field where you can specify the types of entities you want the LLM to extract, but they are only taken into consideration more as a suggestion when indexing and completely ignored when prompt tuning.

Thanks for updating this information!

Aug 29 '24 07:08 harshaharod21

Thanks for the discussion!

The prompt for the entity extraction is this one by default: https://github.com/microsoft/graphrag/blob/main/graphrag/index/graph/extractors/graph/prompts.py As you can see the entity_types from the settings.xml are injected in line 13. Since it is a prompt there are always chances that the llm ignores the instructions from the prompt. I really would love if you could contribute your post-extraction filter @andreiionut1411 since I think this would solve the fact that it is just a suggestion!

However the original problem for this issue is the fact that the entity types in the settings.xml are completely ignored when prompt tuning. Prompt tuning like this python -m graphrag.prompt_tune --root . --no-entity-types --config ./settings.yaml with "--no-entity-types" removes specific entity types from the prompt completely. While prompt tuning without that activates another step, that tries to guess relevant entity types in your data and injects them into the prompt. You can run the prompt tuning with the different settings and checkout the resulting extraction prompt which is written to disk. I found that "--no-entity-types" is really bad on diverse data, which is in contrast to the recommendation which is to use it. And also without that flag way too many entity types are created.

In my opinion the most common use case is to extend the default entity types which are very reasonable (geo, person, event, ..) to only some selected domain specific entities.

Sep 02 '24 14:09 DanielSchuhmacher

I have a similar issue. I would like to use Microsoft GraphRAG to spin up quickly a chat application where entities come both from this process (from the raw body of the documents) and from additional document metadata I have already in place.

I would like then the query function to consider entities regardless whether they were found and created by the LLM or whether they were already in the document metadata.

Is there an easy workaround for this?

Nov 20 '24 10:11 paolotamag

i am using

!python -m graphrag.prompt_tune --root . --config ./settings.yaml --domain "lorem ipsum" --language English --no-entity-types

where lorem ipsum is clearly a placeholder, i described generically the domain of my documents

can someone confirm that the command prompt tuning is actually corrupting the community_report.txt rather than improving it?

this here is the default one created by init

after Goal and before Report Structure I get Domain added and it is simply pasting one of my documents there. Isn't that lousy? My document are mostly raw conversations..

more issues:

claim_extraction.txt is not tuned
entity_extracction.txt ignores my entity_types listed in settings.yaml under entity_extraction (this is mentioned above i think) and I don't see much difference remove --no-entity-types
the docs have an error, the option --selection-method is listed instead as --method in its description
using auto on default settings leads to this error

Loading Input (InputFileType.csv).Process failed to invoke LLM 1/10 attempts. Cause: rate limit exceeded, will retry. Recommended sleep for 86400 seconds. Follow recommendation? True

isn't that 24 hours? XD

Nov 20 '24 12:11 paolotamag

Hi, I was wondering if there are any updates or plans regarding this feature request? Thanks

Aug 09 '25 19:08 bmelloul