aleph
aleph copied to clipboard
BUG: Highlights not working for more than 1 token but I am completly lost here
Describe the bug When searching for "lorem", in the highlight context preview the token "lorem" is highlighted. When searching for "lorem ipsum", nothing is displayed.
To Reproduce Steps to reproduce the behavior:
- Check out the current
main
ordevelop
branch (both affected) on a local machine - Expose port
9200
from elasticserch to localhost for debugging - Don't change anything else, just start a fresh aleph with
docker compose up
- Create a new investigation
- Upload a text document
highlight.txt
with this content:Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
- In the Aleph UI, search for "lorem ipsum" vs. "lorem" (refresh the browser between searches as highlights might be cached in the UI)
- check the highlight query aleph is producing against elastic:
curl -X GET "localhost:9200/aleph-entity-plaintext-v1/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"bool":{"should":[{"query_string":{"query":"lorem ipsum","lenient":true,"fields":["fingerprints.text^3","text"],"default_operator":"AND","minimum_should_match":"66%"}}],"must":[],"must_not":[],"filter":[],"minimum_should_match":1}},"post_filter":{"bool":{"filter":[]}},"from":0,"size":30,"aggregations":{},"sort":[],"highlight":{"encoder":"html","fields":{"properties.*":{"highlight_query":{"query_string":{"query":"lorem ipsum","lenient":true,"fields":["properties.*"],"default_operator":"AND","minimum_should_match":"66%"}},"require_field_match":false,"number_of_fragments":3,"fragment_size":120,"max_analyzed_offset":999999}}},"_source":{"includes":["schema","properties","collection_id","profile_id","role_id","mutable","created_at","updated_at"]}}'
-> no highlights for "lorem ipsum", but for "lorem".
Expected behavior When searching for "lorem ipsum", the two tokens "lorem ipsum" should be highlighted.
Aleph version
This happens (locally) on the current main
and develop
heads. But it's really strange as for instance in the OCCRP instance I tested it and it behaves normally as expected. And on one production instance I am running it works, on another not. I am completly lost to track this down, the instances I checked are running on ingest-file
3.19.1 except the OCCRP one which is running on 3.19.2-rc1
, I checked this version locally as well, same BUG. I guess this has nothing to do with ingest-file
but with aleph.index
but I can not see any relevant changes there within the last months that could affect this.
Screenshots
Searching for "lorem ipsum" on current main
:
Searching for "lorem ipsum" on OCCRP aleph instance:
Additional context I am completly lost here to track this down. So my first question would be, can anyone reproduce this (locally)?
The only difference I could spot between machines where this bug exists is the underlying host system. Bug: Linux 6.1.0-10-amd64 No bug: Linux 5.10.0-23-amd64 But I thought that's what docker is for...
Playing around with this, i found that removing the:
"fields": [
"properties.*"
],
from:
"highlight": {
"encoder": "html",
"fields": {
"properties.*": {
"highlight_query": {
"query_string": {
"query": "lorem ipsum",
"lenient": true,
"fields": [
"properties.*"
],
"default_operator": "AND",
"minimum_should_match": "66%"
}
},
"require_field_match": false,
"number_of_fragments": 3,
"fragment_size": 120,
"max_analyzed_offset": 999999
}
}
},
Will return highlights for both lorem
& lorem ipsum
. I'm not sure why.
Thanks for debbuging with me. So you can confirm you have this same behaviour? It is super weird as I cannot spot any relevant changes in the codebase that could produce this bug. And indeed, removing the field array within the highlight query would display the expected highlights...
So you can confirm you have this same behaviour?
Yes here i also only got highlights for lorem
and not for lorem ipsum
.
And indeed, removing the field array within the highlight query would display the expected highlights...
Yeah, as the field is already specified in the fields
dict, i don't see a reason for including it here again. Why would it be specified here again?
for this specific issue, @Rosencrantz would you like this (and only exactly this ;)) commit as a PR: https://github.com/investigativedata/aleph/commit/c1546c99411f3f318bcca53bb62d012b8bfc3ab8
The weird thing is is, apparently we cannot really say why this bug exists and why this one-liner would fix it, so I am hesitating on just contributing a fix without knowing too much about it :see_no_evil:
So you can confirm you have this same behaviour?
Yes here i also only got highlights for
lorem
and not forlorem ipsum
.
On what host system are you running? As described above, I noticed a difference in the linux kernel on machines that have this behaviour, but again, it would be totally crazy if this is related to the host system...
for this specific issue, @Rosencrantz would you like this (and only exactly this ;)) commit as a PR: investigativedata@c1546c9
The weird thing is is, apparently we cannot really say why this bug exists and why this one-liner would fix it, so I am hesitating on just contributing a fix without knowing too much about it 🙈
So. We have an issue that we think we can fix, but we are not sure why this change fixes the issue. By extension I assume we also don't know what other ramifications may exist by introducing this change.
I think that probably the best course of action would be to open a PR to develop and then investigate the implications of this change in staging and then on aleph.occrp.org. This should shake out any potential issues and allow us to make an informed decision on whether to keep of discard the change.
@simonwoerpel Can you open a PR to develop?
https://github.com/alephdata/aleph/pull/3278