R2R icon indicating copy to clipboard operation
R2R copied to clipboard

How to enforce an ontology in KG extraction

Open MichaelJeffersonCook opened this issue 9 months ago • 2 comments

Hi,

Love the work!

I'm curious. Is there a way to enforce a specific ontology when extracting data from PDF documents? I'm working on an application where I want to be able to search by specific fields, but the ingestion appears to use different categories for the same thing across disparate documents. What I want is for the ingestion to be able to use a template that would describe which fields to extract and what their names are.

For example, say we have 3 insurance policy documents. When ingested I found that one had the category "Policy" and another had the category "Policy Number". Curiously, both had policy number in the description field. I'd want it to be able to specify to use the category "Policy Number".

Thank you! Mike

MichaelJeffersonCook avatar Feb 09 '25 15:02 MichaelJeffersonCook

After starting up R2R in a container, I added "Policy Number" and "Insurance Company" as strings in the entity_types array in the .toml file. I saw in "Settings" that these entities were read in.

I'm not sure what to expect when defining these. I saw that entities were extracted and the above two entity names were used as a subset of the entities found during the extraction. I was not sure, but I thought it logical, that the extraction would only use the entity_types defined.

How is this feature supposed to work?

MichaelJeffersonCook avatar Feb 12 '25 23:02 MichaelJeffersonCook

Sorry for the delayed response. You're correct that both entity_types and relation_types should be configurable. It seems that we are not properly propagating these during the extraction process. I will add it to my short-list to reimplement, otherwise we'd definitely welcome a PR around this!

NolanTrem avatar Feb 14 '25 18:02 NolanTrem