aleph icon indicating copy to clipboard operation
aleph copied to clipboard

BUG: Highlights not working for more than 1 token but I am completly lost here

Open simonwoerpel opened this issue 1 year ago • 8 comments

Describe the bug When searching for "lorem", in the highlight context preview the token "lorem" is highlighted. When searching for "lorem ipsum", nothing is displayed.

To Reproduce Steps to reproduce the behavior:

  1. Check out the current main or develop branch (both affected) on a local machine
  2. Expose port 9200 from elasticserch to localhost for debugging
  3. Don't change anything else, just start a fresh aleph with docker compose up
  4. Create a new investigation
  5. Upload a text document highlight.txt with this content: Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
  6. In the Aleph UI, search for "lorem ipsum" vs. "lorem" (refresh the browser between searches as highlights might be cached in the UI)
  7. check the highlight query aleph is producing against elastic:
curl -X GET "localhost:9200/aleph-entity-plaintext-v1/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"bool":{"should":[{"query_string":{"query":"lorem ipsum","lenient":true,"fields":["fingerprints.text^3","text"],"default_operator":"AND","minimum_should_match":"66%"}}],"must":[],"must_not":[],"filter":[],"minimum_should_match":1}},"post_filter":{"bool":{"filter":[]}},"from":0,"size":30,"aggregations":{},"sort":[],"highlight":{"encoder":"html","fields":{"properties.*":{"highlight_query":{"query_string":{"query":"lorem ipsum","lenient":true,"fields":["properties.*"],"default_operator":"AND","minimum_should_match":"66%"}},"require_field_match":false,"number_of_fragments":3,"fragment_size":120,"max_analyzed_offset":999999}}},"_source":{"includes":["schema","properties","collection_id","profile_id","role_id","mutable","created_at","updated_at"]}}'

-> no highlights for "lorem ipsum", but for "lorem".

Expected behavior When searching for "lorem ipsum", the two tokens "lorem ipsum" should be highlighted.

Aleph version This happens (locally) on the current main and develop heads. But it's really strange as for instance in the OCCRP instance I tested it and it behaves normally as expected. And on one production instance I am running it works, on another not. I am completly lost to track this down, the instances I checked are running on ingest-file 3.19.1 except the OCCRP one which is running on 3.19.2-rc1, I checked this version locally as well, same BUG. I guess this has nothing to do with ingest-file but with aleph.index but I can not see any relevant changes there within the last months that could affect this.

Screenshots Searching for "lorem ipsum" on current main: Screen Shot 2023-08-17 at 10 54 55 Searching for "lorem ipsum" on OCCRP aleph instance: Screen Shot 2023-08-17 at 10 50 20

Additional context I am completly lost here to track this down. So my first question would be, can anyone reproduce this (locally)?

simonwoerpel avatar Aug 17 '23 09:08 simonwoerpel

The only difference I could spot between machines where this bug exists is the underlying host system. Bug: Linux 6.1.0-10-amd64 No bug: Linux 5.10.0-23-amd64 But I thought that's what docker is for...

simonwoerpel avatar Aug 17 '23 09:08 simonwoerpel

Playing around with this, i found that removing the:

            "fields": [
              "properties.*"
            ],

from:

  "highlight": {
    "encoder": "html",
    "fields": {
      "properties.*": {
        "highlight_query": {
          "query_string": {
            "query": "lorem ipsum",
            "lenient": true,
            "fields": [
              "properties.*"
            ],
            "default_operator": "AND",
            "minimum_should_match": "66%"
          }
        },
        "require_field_match": false,
        "number_of_fragments": 3,
        "fragment_size": 120,
        "max_analyzed_offset": 999999
      }
    }
  },

Will return highlights for both lorem & lorem ipsum. I'm not sure why.

monneyboi avatar Aug 23 '23 13:08 monneyboi

Thanks for debbuging with me. So you can confirm you have this same behaviour? It is super weird as I cannot spot any relevant changes in the codebase that could produce this bug. And indeed, removing the field array within the highlight query would display the expected highlights...

simonwoerpel avatar Aug 28 '23 08:08 simonwoerpel

So you can confirm you have this same behaviour?

Yes here i also only got highlights for lorem and not for lorem ipsum.

And indeed, removing the field array within the highlight query would display the expected highlights...

Yeah, as the field is already specified in the fields dict, i don't see a reason for including it here again. Why would it be specified here again?

monneyboi avatar Aug 28 '23 12:08 monneyboi

for this specific issue, @Rosencrantz would you like this (and only exactly this ;)) commit as a PR: https://github.com/investigativedata/aleph/commit/c1546c99411f3f318bcca53bb62d012b8bfc3ab8

The weird thing is is, apparently we cannot really say why this bug exists and why this one-liner would fix it, so I am hesitating on just contributing a fix without knowing too much about it :see_no_evil:

simonwoerpel avatar Aug 29 '23 08:08 simonwoerpel

So you can confirm you have this same behaviour?

Yes here i also only got highlights for lorem and not for lorem ipsum.

On what host system are you running? As described above, I noticed a difference in the linux kernel on machines that have this behaviour, but again, it would be totally crazy if this is related to the host system...

simonwoerpel avatar Aug 29 '23 08:08 simonwoerpel

for this specific issue, @Rosencrantz would you like this (and only exactly this ;)) commit as a PR: investigativedata@c1546c9

The weird thing is is, apparently we cannot really say why this bug exists and why this one-liner would fix it, so I am hesitating on just contributing a fix without knowing too much about it 🙈

So. We have an issue that we think we can fix, but we are not sure why this change fixes the issue. By extension I assume we also don't know what other ramifications may exist by introducing this change.

I think that probably the best course of action would be to open a PR to develop and then investigate the implications of this change in staging and then on aleph.occrp.org. This should shake out any potential issues and allow us to make an informed decision on whether to keep of discard the change.

@simonwoerpel Can you open a PR to develop?

Rosencrantz avatar Aug 29 '23 08:08 Rosencrantz

https://github.com/alephdata/aleph/pull/3278

simonwoerpel avatar Aug 29 '23 12:08 simonwoerpel