Jan Niestadt
Jan Niestadt
Start (and end) position would also be useful; frontend would use this to jump to a specific hit in a fragment, for example (@KCMertens in INL/corpus-frontend#527).
I've made a [small start](https://github.com/INL/BlackLab/tree/experiment/highlight-hitnumbers) with this, but complex nesting makes this nontrivial.
After making sure BlackLab40Codec uses Lucene87 as the delegate codec (previously it requested the default codec from Lucene, which is obvisouly different in Lucene 9), we still have a problem...
Everything seems to work now. There's a new codec (BlackLab50Codec) that essentially identical to the previous one, except it's based on Lucene99 instead of Lucene87, so you can write indexes...
I've created a new branch `jakarta` where we will continue developing this version. Closing this PR, opening a new one.
Ah, hadn't realized that, nice. Documenting that (if it isn't already) is probably enough then.
It might even be better to freeze the application while the export happens, otherwise the user might think it's okay to continue using the application and accidentally cancel the download.
Attempted fix in 981b8a5de, but this causes a StackOverflowError in the regex evaluation (even though small-scale test succeeds). Possibly the document is too large for handling this way. See https://stackoverflow.com/a/7510006
It might be time to consider rewriting highlighting of document fragments using something like https://jsoup.org/ If that doesn't seem practical for whatever reason, another alternative is to loop through the...
@KCMertens mentioned that Saxon's parsing can be customized as well, including how to deal with unbalanced tags; maybe this could be a good solution