corpus-frontend icon indicating copy to clipboard operation
corpus-frontend copied to clipboard

Multivalued annotation displaying/sorting/grouping/statistics

Open JessedeDoes opened this issue 6 years ago • 2 comments

Not sure this has not been submitted before, but e.g. only one of multiple lemmata is displayed

(http://svotmc10.ivdnt.loc/corpus-frontend/Gysseling/search/hits?first=0&number=20&patt=%5Bword%3D%22tantwordene%22%26lemma%3D%22antwoorden%22%5D&interface=%7B%22form%22%3A%22search%22%2C%22patternMode%22%3A%22extended%22%7D)

JessedeDoes avatar Jun 13 '19 10:06 JessedeDoes

This could be tricky to do as the concordances are created from the forward index, which at present only stores the first value indexed at every position.

One option is to create the concordances from the content store (this used to be how we did it, and should still be possible with the parameter usecontent=orig. This basically returns part of the original XML, so all the values should be in there. But getting this to work with corpus-frontend might be a challenge as the concordances would have a project-specific XML structure.

jan-niestadt avatar Jun 13 '19 12:06 jan-niestadt

There are sevaral related problems:

  • While searching on multivalues annotations works fine, we cannot display more than one (due to the aforementioned forward-index issue)
  • Because the forward index only stores the first value, grouping and sorting will not work on the second and beyond values on one token.
  • Additionally: we must decide whether we want to include a token with multiple values in multiple groups, or not. There is also ambiguity in counting such hits. When a token has lemma=a and lemma=b, does the query lemma=a|b produce one hit or two?

KCMertens avatar Oct 09 '19 15:10 KCMertens

The common workaround we landed on is to index the value twice, in two different annotations Once tokenized Once concatenated

Then:

  • give the annotations both the same name
  • configure the frontend using custom js
    • display the concatenated version in the results (using concordanceAnnotationId)
    • use the tokenized one for searching (setting searchAnnotationId)
    • use the concatenated one for grouping and sorting operations, hide the other one from those options

KCMertens avatar Apr 07 '25 14:04 KCMertens