aleph
aleph copied to clipboard
Highlighting of search terms does not work on Documents
I have set ALEPH_RESULT_HIGHLIGHT=true in aleph.env and highlighting of the search term seems to work on entity types other than Documents. I.e. searching for a term that results in hits in CourtCase entities will display a highlight of the search term if the search term is in i.e. the Summary field of CourtCase.
For Documents highlighting works if the search term is in the title attribute of the Document entity (or other attributes), but not if the search term is in the indexed text of the document.
I can see from the text tab in the detail view of the document that the text has been indexed correctly, but still no highlight is displayed.
A search result limited to CourtCase entities returns highlight attribute. Limiting to Document (Pages) entities will not return the highlight attribute on the results.


The strange thing is that highlighting seems to work if the entity is of type Image which has been OCRed into text. Then the search term (if found in OCRed text) will appear as highlighted.
If I do a search within the document when viewing the Document entity itself (the scope is only that Document), then highlighting works and the Pages returned contain the highlight attribute
Any hints on where to start debugging this @sunu? This is a bit of a show stopper for our users, and I'd like to get it sorted out before we start using Aleph. I am happy to dig in more to see if I can find the bug. I can now add that ingesting pure txt-documents does not generate highlights either. Very strange.
Hey @anderser, I don't have any good guesses for the potential cause of this issue. But I would start by inspecting the ElasticSearch queries that don't return a highlight and their raw result from ElasticSearch before they get serialized by Aleph. This should help us look for any serialization bug if there's one.
Digging into this, it seems that entities of type Image or PlainText returns a highlight property on the ES search result if the highlight_query is removed from the ES query.
The default query when searching top level in Aleph (http://localhost:8080/search?limit=30&q=Sorin&sort=caption%3Aasc) seems to look like this
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "Sorin",
"lenient": true,
"fields": [
"fingerprints.text^3",
"text"
],
"default_operator": "AND",
"minimum_should_match": "66%"
}
}
],
"must": [],
"must_not": [],
"filter": [],
"minimum_should_match": 1
}
},
"post_filter": {
"bool": {
"filter": []
}
},
"from": 0,
"size": 30,
"aggregations": {},
"sort": [],
"highlight": {
"encoder": "html",
"fields": {
"properties.*": {
"highlight_query": {
"query_string": {
"query": "Sorin",
"lenient": true,
"fields": [
"properties.*"
],
"default_operator": "AND",
"minimum_should_match": "66%"
}
},
"require_field_match": false,
"number_of_fragments": 3,
"fragment_size": 120
}
}
},
"_source": {
"includes": [
"schema",
"properties",
"collection_id",
"profile_id",
"role_id",
"mutable",
"created_at",
"updated_at"
]
}
}
Dumping get_body() from here: https://github.com/alephdata/aleph/blob/529858ec27eb77b1b050169c0a977176235f4c8e/aleph/search/query.py#L281
As you can see the highlight_query is limited to the fields properties.* and that seems to stop it from generating highlights. But if I set the fields value of the highlight_query to all the fields used in the main query like this:
"fields": [
"properties.*",
"fingerprints.text^3",
"text"
],
The results will return a highlight property if the entity is of type Image or PlainText. The same applies if the highlight_query property is removed totally, commenting out this line: https://github.com/alephdata/aleph/blob/main/aleph/search/query.py#L227.
The ES docs seems to indicate that one should take note when using boolean queries and highlight: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/highlighting.html#highlighting
I don't know if that applies here. And I am not that deep into the inner workings of Aleph query construction to really see what is failing here.
PDFs/Pages
In addition removing highlight_query does not help getting highlight on Pages (PDF documents).
But I see that the index aleph-entity-page-v1 is not included in the search. If I include that in manual ES searches, entities of type Page are returned and they do have highlighting if making the above adjustments to highlight_query. I must say this confused me even more, but I suppose Page entities are not returned in top level searches.
I'll happily discuss this more with someone with deep knowledge of Aleph search logic.
Seems like this is caused by Page not extending Thing and the main query including filter:schemata=Thing: https://github.com/alephdata/aleph/blob/5663e30d9138c2ff3137caf40a9afbdca896223d/ui/src/queries.js#L16
When i change this query to /entities?filter:schemata=Thing,Page&highlight=true&limit=30&q=test it returns Thing + Page entities with highlights. Though this does not feel as the most elegant solution.
I also tried including &exclude:schemata=Pages in the query to filter out the multi-page documents, but it seems that in that case, the Document entities themselves are still returned.
Ah, i was mistaken, the Document entities that are returned have the search string in their title, as expected.
So changing the query from /entities?filter:schemata=Thing&highlight=true&limit=30&q=test to /entities?filter:schemata=Thing,Page&exclude:schemata=Pages&highlight=true&limit=30&q=test would return pages with highlights, while filtering the corresponding documents containing the search string in their pages.
Would this be an acceptable change?
I don't think that is the best solution, no. Wouldn't that return a lot of results if you search for a term that is present on many pages in a document? I think the users expect to get one result per document in the result list, but with one or more highlights from that document (in reality from the Pages in that document). Returning pages as separate results would probably make more noise than before.
Hey @anderser, @monneyboi! Thanks for looking into the highlighting issue. I spent some time on it this morning and here's my thoughts:
- I kind of agree with @anderser here that returning pages directly in the search results page has the potential to pollute the results if we show them individually.
- I tried a variation of
highlight_queryas mentioned in https://github.com/alephdata/aleph/issues/2138#issuecomment-1071096093 and for Documents I always end up with highlight results that highlighted a completely irrelevant part of the text. I need to dig deeper to find out why that's happening - Re: the query change @monneyboi mentioned in https://github.com/alephdata/aleph/issues/2138#issuecomment-1117290523, the query is slightly malformed. The correct syntax would be something like:
/entities?filter:schemata=Thing&filter:schemata=Page&exclude:schemata=Pages&highlight=true&limit=30&q=test. This would actually not return anyPagesat all and I don't think that's a great option.
I wonder if we should make an extra request to Elasticsearch while serializing a batch of results to fetch highlighted pages for the documents in that result batch. Then we can show the matched pages with highlights by nesting them inside document results. Something like this:

This is would take more work to implement but the end result might be more useful than showing text snippets only?
Let me know what you think!
Thanks for investigating @sunu!
As for showing the pages inlined in the results I think this also might pollute the result. Usually the journalists do a search and then scan the results fast visually to find relevant documents. I think a small snippet of some of the hits in the document with surrounding words is enough.
Knowing which page the hit is on is not relevant in this context, in my opinion. You just want to see if this document is relevant or not. If it seems relevant, you can dig in and do a search inside the document to find out more about the pages etc. But that might differ for other users and use cases of course!
I see that DocumentCloud uses that way of showing it. It might of course work, but then you would have to limit the number of highlights/pages shown at least so that your eyes can scan forward to the next document hit fast without needing to scroll through several pages (think of 100 page doc with hit on every page).
It is not easy to care for both fast scanning of search results, and giving the user enough context to decide if this is a relevant doc or not...
Yes, if we decide to show the pages like DocumentCloud does, we definitely need to limit it to 2-3 pages max.
But if we keep the current behaviour, I guess the next step is to figure out why is the Pages index returning inaccurate highlights. I'll try to dig more into that when I get the time. If you have the time to craft some raw Elasticsearch queries and test highlight queries against the Pages index, please feel free to have a go.
Knowing which page the hit is on is not relevant in this context, in my opinion. You just want to see if this document is relevant or not. If it seems relevant, you can dig in and do a search inside the document to find out more about the pages etc. But that might differ for other users and use cases of course!
I agree that showing pages isn't necessary, showing a snippet with surrounding words would be enough.
hi @anderser , @sunu how are you? ^^ @monneyboi said I should come to the party :)
I think knowing the page number is relevant because you might want to search for more context on the pages - the context might be a lot larger than the snippet. Or you can see a search term concentrated in a certain part of the document. It is valuable info imho
Hmm, after looking at this more closely, i think we're actually discussing 2 separate things here.
- The search highlights are currently not working for
Pagesentities, which is an issue. - How to improve the search for documents to provide more context about what page a search term is found on, which is more like a feature request.
Would it be an idea to open a separate issue to track point 2, the page context in search?
Also, when i've got some time next week, i'll see if i can investigate the missing highlights.
Yes, lets focus on fixing point 1 first. For point 2, I have created a new issue: https://github.com/alephdata/aleph/issues/2270 and linked to our opinions in this thread.
The search matches on a text field that is excluded from search results by default. Also, this text field is not part of the properties.* field that is used in the entities highlight query, so won't be included in the highlight.
After including this field in results & highlights, highlight results are weird, with elastic highlighting wrong parts of the text, which seems to be caused by the term_vector configuration, when i comment this line, i get proper search highlights.
I've asked @pudo for a bit of explanation before i start tackling this, as i'd love to know a bit more about the reasoning behind excluding these text fields etc.
Should be fixed with https://github.com/alephdata/aleph/pull/2416