datashare icon indicating copy to clipboard operation
datashare copied to clipboard

Use elasticsearch's annotated-text field type for entities?

Open markharwood opened this issue 3 years ago • 4 comments

I'm new to datashare and was poking around the elasticsearch mapping and was surprised to see the use of parent/child docs for the extracted entities. Did you ever consider using the annotated text field type for entities? It's a feature I added to support this sort of use case. Am I missing something important with regards to why parent/child mappings were used in datashare? Sorry if I've overlooked something obvious.

markharwood avatar Jul 13 '22 14:07 markharwood

Hi Mark, you raised an excellent point!

ES told us about this last year but we didn't invest time yet to see how it could improve our current mapping. To be honest, we did not know about this feature when we started to index Named Entities and I'm not sure it even exist yet (in 2017 if I recall correctly).

How do you think it could improve our indexes? In terms of performances, maybe?

pirhoo avatar Jul 13 '22 14:07 pirhoo

I'm not sure it even exist yet (in 2017 if I recall correctly).

Fair enough.

How do you think it could improve our indexes? In terms of performances, maybe?

I'd expect search to be faster and capable of doing more things. By weaving structured data into the text you can do proximity queries like finding mention of any entity name known to be a politician placed next to free-text that is about lobbying/bribing/meeting etc. I illustrated some of that in this demo

Before proposing you switch I did want to check if there was a good reason you'd need parent/child e.g. securing certain entity-mention docs? Is performance currently an issue?

markharwood avatar Jul 13 '22 14:07 markharwood

I'm not even sure if the relationship parent/child really matter anymore. At the time we did that relationship as people coming from the relational databases world. I don't think we use it any more.

The only change we did to our NamedEntity structure was to optimize disk usage. We create one NamedEntity for each file, then it contains the list of all occurrences index in the same file.

Performance has been a big issue with indexes were we had hundreds of millions of Named Entities and we where trying to do aggregations (to count them).

pirhoo avatar Jul 13 '22 15:07 pirhoo

I'd be tempted to model docs as follows:

  • Structured keyword fields e.g. "Person" or "Organisation" with arrays of values. Would be fast for aggregations.
  • Unstructured text fields using annotated-text type to hold text and structured values

Example JSON doc:

{  
    "person" : ["Mark MacGann", "Emmanuel Macron"],
    "organisation" : ["Uber"],
    "content" : "[Uber](Uber)'s [MacGann](Mark%20MacGann) met with [Macron](Emmanuel%20Macron) later that day"
}

The problem with parent/child is that when you search on the docs (e.g. "Uber AND met") and try aggregate on the related entities like person, elasticsearch is having to join on the basis of potentially millions of shared doc IDs. This is never going to be as fast as when the search text and related person entities are just held in the same Lucene document.

markharwood avatar Jul 13 '22 16:07 markharwood

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] avatar Nov 11 '22 00:11 github-actions[bot]

We should do a POC to ensure this is a good fit for Datashare. Planned for Q1/2023.

pirhoo avatar Nov 14 '22 10:11 pirhoo

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] avatar Dec 25 '22 00:12 github-actions[bot]

This issue was closed because it has been inactive for 20 days since being marked as stale.

github-actions[bot] avatar Jan 15 '23 00:01 github-actions[bot]