kitodo-presentation [FUND] Update and improve Solr compatibility

Description

On high traffic installations with lots of fulltext documents (>200.000) the performance of the Solr-index is getting poor. This is caused because of permanent indexing of new documents with high usage of searches in parallel. This applies not only the search plugin but also the collection and OAI plugin.

Some research has already been done and tasks are identified in #454.

The goal of this proposal is to update all Solr-related code and configuration in order to use the newest version of Apache Solr and make installation and configuration as easy and well-documented as possible.

Expected benefits of this development

Clients using Kitodo.Presentation
- speed up full-text search
- speed up collection view (multiple collection)
Administrators
- faster indexing
- faster re-indexing of huge collections or "all"
- simplify update and migration of Solr instances
- possibly less requirements for RAM and/or CPU resources

Estimated Costs and Complexity

This issue has high complexity and medium cost.

Related Issues

#454
#396
#825
#860

Feb 18 '21 19:02 albig

Reintroducing this for the development fund 2023. The issue became more urgent recently not only because new features regularly require reindexing of all documents, but also because current versions of Solr not only deprecate using index-time boosting, but don't support it anymore. So this effectively prevents us from using an up-to-date Solr version.

Feb 06 '23 09:02 sebastian-meyer

Votes: 12

Mar 20 '23 13:03 sebastian-meyer

We at the SUB Hamburg already made several adjustments on our SOLR instances (configurations, schemata and some tweaks in kitodo.presentation sources) and are currently using SOLR 8.11.1 in our livesystems and are experimenting with SOLR 9.1.1 in our dev-systems (which works fine btw, after some smaller adjustments). Some of those changes and insights resulted in PR's improving search...others are still in the working.

If this topic gets further traction, i would happily offer my help and would like to join the discussion. For me its an important topic to improve overall performance (indexing & retrieval & maintanence). And of course some things we are working at could be impacted in a negative way, if development on the SOLR would make unexpected shifts.

Mar 20 '23 14:03 michaelkubina

Hello Uli, as promised i share with you the analyzer-chain, that we applied to the standard field and to the text_ocr field. All Filter (except for the ocr-highlighting filter) are part of solr itself (as you already know its documented here: https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html ). In the solr-admin-ui you can inspect, how the filters get applied, when using the "Analysis" tab, that you find within your core.

Foremost: we have added a _version_ field to the index. This allows for keeping track of document versioning, which does no harm. But this allows for partial document updates, like atomic updates (https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html#atomic-updates). This is useful, when only small corrections at the indexed documents should be applied (like changing a url-prefix in a specific field and such), instead of re-indexing a whole document with all its fulltexts and logical structure. I believe its useful to have this option, even if one does not make use of it. Without the _version_ field partial document updates wont be possible.

The sorting related change of the fieldtype for *_sorting is already in place due to a past commit to this branch. no need for changes here...just remember, that a new solr-core must be index for it to work. Otherwise one will run into exceptions...

The standard fieldtype has seen several changes, most notably:

We apply an ICU transformation (https://icu4c-demos.unicode.org/icu-bin/translit), so that we could possibly match tokens in other scripts as well. A search for "platon" would also find "Πλάτων" or "Платон" in greek and cyrrilic and vice versa. This is because the transliteration into latin falls into the same string representation. Of course this does not work in all cases...
We have also applied a keyword-repeat-filter, that duplicates the tokens and applies a keyword-flag - meaning that those keywords dont get stemmed, but also keep the original position increment. When searching the exact match gets ranked higher then a stemmed token, that matches as well. Since we keep the position increment it works in the phrase search as well and ranks exact matching phrases higher. The remove-duplicate-tokens-filter removes remaining dublicate tokens afterwards.
We removed the stop-word-filter, because we had several valid complaints about it.
The word-delimiter-graph-filter catenates now also hyphenated words, so that a word like "universitäts-bibliothek" gets indexed as "universitätsbibliothek", "universitäts" and "bibliothek". Prior we only got "universitäts" and "bibliothek" for example.
The documentation states, that a flatten-graph-filter should be applied after the word-delitimiter-graph-filter (only at index time).

<fieldType name="standard" class="solr.TextField" positionIncrementGap="100">
	<analyzer type="index">
		<!-- michaelkubina: tokenize at whitespace -->
		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<!-- michaelkubina: translitarate according to ICU Transformation to latin script -->
		<filter class="solr.ICUTransformFilterFactory" id="Any-Latin"/>
		<!-- michaelkubina: apply ICU Folding on latin script (basically like ascii folding) -->
		<filter name="icuFolding"/>
		<!-- michaelkubina: lowercase tokens as soon as possible -->
		<filter class="solr.LowerCaseFilterFactory"/>
		<!-- michaelkubina: catenate hyphenated words or combinations of alphanumericals ; camelcase wont happen due to lowercasefilter at the beginning; removes all non-alphanumericals as well -->
		<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" preserveOriginal="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
		<!-- michaelkubina: flatten word graph -->
		<filter class="solr.FlattenGraphFilterFactory"/>
		<!-- only needed in the index-analyzer -->
		<!-- michaelkubina: keep keywords as duplicate tokens and prevent them from getting stemmed -->
		<filter class="solr.KeywordRepeatFilterFactory"/>
		<!-- michaelkubina: do the stemming -->
		<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords.txt"/>
		<!-- michaelkubina: remove duplicate tokens for the same position increment -->
		<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
	</analyzer>
	<analyzer type="query">
		<!-- michaelkubina: allow synonym-aggregation at query-time -->
		<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
		<!-- michaelkubina: tokenize at whitespace -->
		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<!-- michaelkubina: translitarate according to ICU Transformation to latin script -->
		<filter class="solr.ICUTransformFilterFactory" id="Any-Latin"/>
		<!-- michaelkubina: apply ICU Folding on latin script (basically like ascii folding) -->
		<filter name="icuFolding"/>
		<!-- michaelkubina: lowercase tokens as soon as possible -->
		<filter class="solr.LowerCaseFilterFactory"/>
		<!-- michaelkubina: catenate hyphenated words or combinations of alphanumericals ; camelcase wont happen due to lowercasefilter at the beginning; removes all non-alphanumericals as well -->
		<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" preserveOriginal="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
		<!-- michaelkubina: keep keywords as duplicate tokens and prevent them from getting stemmed -->
		<filter class="solr.KeywordRepeatFilterFactory"/>
		<!-- michaelkubina: do the stemming -->
		<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords.txt"/>
		<!-- michaelkubina: remove duplicate tokens for the same position increment -->
		<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
	</analyzer>
</fieldType>

The text_ocr fieldtype is very similiar, but has some additional filters:

The optional synonym-filter is still in place, in case you want to apply synonym enrichtment to the query. It is not required during index-time, because this would unneccesarily and harmfully bloat your index.
the hyphenated-word-filter is pretty clever, because it can catenate two word part with a position increment of 1 into one word, when the first token ends with a hyphen. If the ocr has not correctly detected a hyphenated word, and represented it accordingly in the ALTO, this filter can catenate it. The tokens "Wider-" "stand" will be indexed as "Widerstand" as well.
the reverse-wildcard-filter is usefull (speeds up retrieval), when a reverse wildcard search is triggered. Like *brot will find Schwarzbrot, Steinofenbrot faster, then if it was not in place.

<fieldType name="text_ocr" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
	<analyzer type="index">
		<!-- michaelkubina: account for some ocr-engines escaping html characters -->
		<charFilter class="solr.HTMLStripCharFilterFactory"/>
		<!-- michaelkubina: tokenize at whitespace -->
		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<!-- michaelkubina: translitarate according to ICU Transformation to latin script -->
		<filter class="solr.ICUTransformFilterFactory" id="Any-Latin"/>
		<!-- michaelkubina: apply ICU Folding on latin script (basically like ascii folding) -->
		<filter name="icuFolding"/>
		<!-- michaelkubina: lowercase tokens as soon as possible -->
		<filter class="solr.LowerCaseFilterFactory"/>
		<!-- michaelkubina: compound tokens if hyphen at the end of one token suggests it being part of a compound word with the then following token -->
		<filter class="solr.HyphenatedWordsFilterFactory"/>
		<!-- michaelkubina: catenate hyphenated words or combinations of alphanumericals ; camelcase wont happen due to lowercasefilter at the beginning; removes all non-alphanumericals as well -->
		<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" preserveOriginal="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
		<!-- michaelkubina: flatten word graph -->
		<filter class="solr.FlattenGraphFilterFactory"/>
		<!-- michaelkubina: keep keywords as duplicate tokens and prevent them from getting stemmed -->
		<filter class="solr.KeywordRepeatFilterFactory"/>
		<!-- michaelkubina: remove any trailing or leading whitespaces from tokens, if it happened for any reason -->
		<filter class="solr.TrimFilterFactory"/>
		<!-- michaelkubina: do the stemming -->
		<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords.txt"/>
		<!-- michaelkubina: reverse all tokens, so that they can be found faster in a reverse wildcard search (only needed at index-time) -->
		<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
		<!-- michaelkubina: remove duplicate tokens for the same position increment -->
		<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
	</analyzer>
	<analyzer type="query">
		<!-- michaelkubina: tokenize at whitespace -->
		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<!-- michaelkubina: translitarate according to ICU Transformation to latin script -->
		<filter class="solr.ICUTransformFilterFactory" id="Any-Latin"/>
		<!-- michaelkubina: apply ICU Folding on latin script (basically like ascii folding) -->
		<filter name="icuFolding"/>
		<!-- michaelkubina: lowercase tokens as soon as possible -->
		<filter class="solr.LowerCaseFilterFactory"/>
		<!-- michaelkubina: allow synonym-aggregation at query-time -->
		<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
		<!-- michaelkubina: catenate hyphenated words or combinations of alphanumericals ; camelcase wont happen due to lowercasefilter at the beginning; removes all non-alphanumericals as well -->
		<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" preserveOriginal="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
		<!-- michaelkubina: keep keywords as duplicate tokens and prevent them from getting stemmed -->
		<filter class="solr.KeywordRepeatFilterFactory"/>
		<!-- michaelkubina: remove any trailing or leading whitespaces from tokens, if it happened for any reason -->
		<filter class="solr.TrimFilterFactory"/>
		<!-- michaelkubina: do the stemming -->
		<filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords.txt"/>
		<!-- michaelkubina: remove duplicate tokens for the same position increment -->
		<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
	</analyzer>
</fieldType>

Nov 10 '23 09:11 michaelkubina

Part 2:

The solrconfig.xml has some changes as well...not as complex, as those in the schema. So here is likely more room for optimizations. As you have already realized, the plugins are now in the modules folder...not contrib. And the velocity-browser has been removed, so no need to keep it in place.

we changed the luceneMatchVersion: <luceneMatchVersion>9.3.0</luceneMatchVersion>
we need to add a specific module path for the icu-filters to work and of course the newest ocr-highlighting module from https://github.com/dbmdz/solr-ocrhighlighting:

        <!-- michaelkubina: required for the integration of ICU Transformation and ICU Folding -->
	<lib dir="${solr.install.dir:../../../..}/modules/analysis-extras/lib/" regex=".*\.jar"/>
	<!-- michaelkubina: required for proper ocrHighlighting in kitodo.presentation -->
	<lib dir="${solr.install.dir:../../../..}/modules/ocrhighlighting/lib/" regex=".*\.jar"/>

partial document updates require the update log in place, otherwiese the _version_ field wont work

		<updateLog>
			<str name="dir">${solr.ulog.dir:}</str>
			<int name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}</int>
		</updateLog>

i believe you have mentioned, that you have fixed the warning as well...the LRUCache is deprecated and instead the solr.CaffeineCache should be used
we also had trouble with the TSTLookup (Ternary Tree) in the suggest configuration, resulting in a flood of errors after some time. We use the FSTLookup (Finite State) instead, with which we hadnt encountered any troubles anymore: <str name="lookupImpl">org.apache.solr.spelling.suggest.fst.FSTLookup</str>

If you have furhter questions, then dont hesitate to ask.

Nov 10 '23 09:11 michaelkubina