[Feature Request]: Prompt Tuning with given entities
Do you need to file an issue?
- [X] I have searched the existing issues and this feature is not already filed.
- [X] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [X] I believe this is a legitimate feature request, not just a question. If this is a question, please use the Discussions area.
Is your feature request related to a problem? Please describe.
Prompt auto-tuning identifies too many entities on a diverse document copus like news.
Describe the solution you'd like
The solution would be to provide a set of desired entities to the auto-tuning, like already possible with "domain". Alternative solution would be to provide the user with a list of potential entities, that can be edited before creating the final prompt templates including the few-shot examples.
Additional context
No response
I face a similar issue with the entities extracted in the indexing step. Even though I specify a set of entity types in the config, when creating the graphs, more entities are extracted. I know that this comes from the LLMs taking the types more as a suggestion than a mandatory thing. At the moment I have a band-aid fix in place that filters the entities before adding them into the graph, but it would be nice if there would be a field in the config from which you can specify to graphRAG to only extract the provided entities.
I face a similar issue with the entities extracted in the indexing step. Even though I specify a set of entity types in the config, when creating the graphs, more entities are extracted. I know that this comes from the LLMs taking the types more as a suggestion than a mandatory thing. At the moment I have a band-aid fix in place that filters the entities before adding them into the graph, but it would be nice if there would be a field in the config from which you can specify to graphRAG to only extract the provided entities.
How did you specify the the entity set in the config file? I need to specify some base entities before the prompt extracts entities on its own? How to do that?
you can manually prompt tune using gpt-4o such model.
How did you specify the the entity set in the config file? I need to specify some base entities before the prompt extracts entities on its own? How to do that?
In the settings.yaml there is the entity_extraction part that contains the entity_types field where you can specify the types of entities you want the LLM to extract, but they are only taken into consideration more as a suggestion when indexing and completely ignored when prompt tuning.
How did you specify the the entity set in the config file? I need to specify some base entities before the prompt extracts entities on its own? How to do that?
In the settings.yaml there is the entity_extraction part that contains the entity_types field where you can specify the types of entities you want the LLM to extract, but they are only taken into consideration more as a suggestion when indexing and completely ignored when prompt tuning.
Thanks for updating this information!
Thanks for the discussion!
The prompt for the entity extraction is this one by default: https://github.com/microsoft/graphrag/blob/main/graphrag/index/graph/extractors/graph/prompts.py As you can see the entity_types from the settings.xml are injected in line 13. Since it is a prompt there are always chances that the llm ignores the instructions from the prompt. I really would love if you could contribute your post-extraction filter @andreiionut1411 since I think this would solve the fact that it is just a suggestion!
However the original problem for this issue is the fact that the entity types in the settings.xml are completely ignored when prompt tuning. Prompt tuning like this
python -m graphrag.prompt_tune --root . --no-entity-types --config ./settings.yaml with "--no-entity-types" removes specific entity types from the prompt completely. While prompt tuning without that activates another step, that tries to guess relevant entity types in your data and injects them into the prompt. You can run the prompt tuning with the different settings and checkout the resulting extraction prompt which is written to disk. I found that "--no-entity-types" is really bad on diverse data, which is in contrast to the recommendation which is to use it. And also without that flag way too many entity types are created.
In my opinion the most common use case is to extend the default entity types which are very reasonable (geo, person, event, ..) to only some selected domain specific entities.
I have a similar issue. I would like to use Microsoft GraphRAG to spin up quickly a chat application where entities come both from this process (from the raw body of the documents) and from additional document metadata I have already in place.
I would like then the query function to consider entities regardless whether they were found and created by the LLM or whether they were already in the document metadata.
Is there an easy workaround for this?
i am using
!python -m graphrag.prompt_tune --root . --config ./settings.yaml --domain "lorem ipsum" --language English --no-entity-types
where lorem ipsum is clearly a placeholder, i described generically the domain of my documents
can someone confirm that the command prompt tuning is actually corrupting the community_report.txt rather than improving it?
this here is the default one created by init
after Goal and before Report Structure I get Domain added and it is simply pasting one of my documents there. Isn't that lousy? My document are mostly raw conversations..
more issues:
- claim_extraction.txt is not tuned
- entity_extracction.txt ignores my
entity_typeslisted insettings.yamlunderentity_extraction(this is mentioned above i think) and I don't see much difference remove--no-entity-types - the docs have an error, the option
--selection-methodis listed instead as--methodin its description - using
autoon default settings leads to this error
Loading Input (InputFileType.csv).Process failed to invoke LLM 1/10 attempts. Cause: rate limit exceeded, will retry. Recommended sleep for 86400 seconds. Follow recommendation? True
isn't that 24 hours? XD
Hi, I was wondering if there are any updates or plans regarding this feature request? Thanks