Support future libxapian 2.0
Xapian 2.0 is almost over and bring a few improvements of great value for the libzim and Kiwix.
Even if we will have to keep backward compatibility for a while to give time to GNU/Linux distributions to move on smoothly to latest version, it would be great the we can compile against libxapian 2.0 and distribute binaries based on it.
@ojwb Thank you again for your last PR. During your effort have you already identified a few things we should/could do?
The intention is that the Xapian API should be compatible except for features marked as deprecated since 1.4.0 or earlier, so supporting both 1.4.x and 2.0.x should be easy. (A rare exception is that we also removed LMWeight as the formulae were incorrectly implemented so it's not actually useful to use in 1.4.x; 2.0.x replaces it with 4 separate language-model weighting scheme classes, but it looks like you only use BM25Weight so not relevant here.)
The two changes you've now merged seem to be enough to get libzim working with Xapian 2.0.x. There are some test failures, but they seem to all be due to missing .zim files which I'm guessing are large testdata that isn't in the git repo.
There are a number of new stemmers so if you have any knowledge of those encoded that should be updated (a quick grep didn't find any though).
The glass database format is compatible across 1.4.x and 2.0.x, but some of the stemmers have updates which means some words can stem differently. In most cases only a small number of words are affected and building a database with 1.4.x and searching with 2.0.x (or vice versa) will work well (but using matched versions will work a bit better). An exception though is that the Dutch stemmer has switched to a completely different algorithm. In Dutch vowels may double in the singular form (e.g. manen = moons, maan = moon), and the old algorithm undoubles vowels (so stem is man) while the new one doubles them (so stem is maan). 1.4.x's Dutch stemmer is available in 2.0.x as dutch_porter which could be used for existing Dutch databases, or to postpone a switch. I wouldn't recommend hard wiring this as the Dutch stemmer for everything forever though - the new default Dutch stemmer is enough better it seemed worth the pain of switching.
As for new features you could make use of, these come to mind as perhaps interesting here:
- There's now support for arbitrary glob-like wildcards (single or multiple
*and/or?anywhere in a term rather then just*at the end of a word). Not on by default. - A new fuzzy matching operator so e.g.
foo~expands to words within edit distance 2,foo~3specifies the edit distance, orfoo~0.2as a fraction of the term length. Also not on by default. - New support for using ICU's word-boundary finding for CJK and a few other languages (which is now an optional build-time dependency). Works in TermGenerator, QueryParser and snippet generation.
I also gave an overview of 2.0 at the start of the Xapian BoF session at Debconf earlier this year which covers some of the highlights (video | lower bitrate version).
One thing that's changed since then is we decided to postpone clustering and diversification until a 2.0.x point release (there's a bug in the diversification implementation and it seemed better to release without it than to release with a buggy version or to delay the release; diversification uses clustering behind the scenes and working on that raised some doubts about the current clustering API).