Remove Legacy Range Index
- Closes https://github.com/eXist-db/exist/issues/581
@joewiz In the past eXist would automatically index ID or IDREF node types identified by the XML parser (Xerces). eXist used the legacy range index for this. This meant that indexes were always used when calling fn:id or fn:idref.
To remove the legacy range index I have had to move index support for fn:id and fn:idref to the new range index. As the new range index is optional, if you want to index ID and IDREF type nodes then you need to now define a config for it... just like any other index. I added some new syntax for this (which you have spotted), for example:
<collection xmlns="http://exist-db.org/collection-config/1.0">
<index>
<!-- Range index for fn:id and fn:idref -->
<range>
<ID/>
<IDREF/>
</range>
</index>
</collection>
Got it - makes sense! Somehow I never paid attention to fn:idref() - probably saw it but never understood its possible applications. In TEI, we often point to elements by their @xml:id, but using a hash # prefix (to indicate the target is inside the same document): See <ref target="#ch1">Chapter 1</ref> To look up references to chapter 1 knowing its @xml:id, then, I'd just do: $doc//fn:idref('#ch1') - and if I've indexed this as IDREF, this lookup would presumably as fast or faster than: $doc//tei:ref[@target = '#ch1']? I could then switch @target from a plain xs:string range index to an IDREF. Is that the idea?
Also, will @xml:id automatically be recognized as an ID for use with fn:id() as before, or should people now use both an ID on @xml:id (for fn:id() lookups) and a plain range index (for value comparisons like //tei:div[matches(@xml:id, 'ch\d+')]?
@joewiz Without testing I can't be certain but I would not expect much difference between $doc//fn:idref('#ch1') and $doc//tei:ref[@target = '#ch1'] after all they are both range indexes.
To answer your second question, yes the value of an @xml:id should be recognized by the XML Parser as an ID. Matches on ID lookups are absolute, that is to say eq, so if you want to do value comparisons, then you will also need to define an index for that.
Sounds great! Thanks so much.
Shouldn't we pull this in? For me, the change is too complex to review it in a short time, sorry.
I tested upgrading various apps to the new range index and it is not yet as smooth as it should be. In some cases I received different results after upgrading. This is probably due to bugs and not this pull request, but before we pull, we have to test it out and fix the main roadblocks, so users can upgrade safely. I'm working on it, but it takes some time because those are complex issues.
@wolfgangmm Understood. Can you give me an idea of a time frame though, this one is blocking me from working on other things.
Ok thnx for the feedback...
I hope to have the main issues fixed until the end of the week.
@wolfgangmm Do you have test cases? If I assisted would that help speed things up or are you better left to it?
small optimization, line https://github.com/adamretter/exist/blob/feature/remove-legacy-range-index/src/org/exist/dom/persistent/ElementImpl.java#L1472 can be deleted
how critical to have broker.getIndexController().flush() and broker.flush() after?
@wolfgangmm How has the testing progressed?
@adamretter I'm about to send a pull request to fix the issue with != being processed like =. Just running the test suite right now.
I have also been able to extract a test case which demonstrates that under certain circumstances, the new range index is 10 times or more slower than the legacy range index. I'm not sure if this needs to be fixed before merging your PR, but it would be good if we could sort it out as it may hit quite a lot of users.
Should we pull this one still in? what is the status? or for 3.1 ?
@dizzzz It passes all tests and is complete. I am just waiting on @wolfgangmm's last comment, perhaps he could update us?
@wolfgangmm ? to move to 3.1 ?
@dizzzz there are 2 or 3 issues with the new range index on this bug tracker. Until those are fixed, people still need to be able to fall back to the old index.
At the moment I'm trying to close some long-standing, serious bugs, which unfortunately have already cost me several days of trial and error.
@wolfgangmm Above you mentioned:
@adamretter I'm about to send a pull request to fix the issue with != being processed like =. Just running the test suite right now.
Did that ever get sent and merged?
@adamretter the commit fixing != was this one: https://github.com/eXist-db/exist/commit/43da6f99c72315331b5b9a7f0085af2a33d4022c
In the 2018-05-07 call, @duncdrum noted:
Disable Legacy Range Index: might still be too soon, sad but true. We should wait for functional updates to indexing to get back on this, when there is a need.
@duncdrum Could you please add a note here about what "functional updates to indexing" are missing that are preventing us from merging this in? Thanks!
@joewiz rebasing this and making sure it works as expected is a lot of work. @adamretter mentioned further tweaks to indexing he has in the pipeline. Instead of working to push this out now, we agreed to revisit it when other changes to indexing necessitate the removal of the old index
@joewiz @duncdrum What I meant was that, there is no point me rebasing this unless it is going to be merged. I was under the impression that the legacy range index still does stuff that can't be done with the new range index. If that is not the case, and this could be merged, I am happy to rebase.
@adamretter You asked about the status of this issue in Slack.
In https://github.com/eXist-db/exist/issues/581#issuecomment-91004571, @wolfgangmm wrote:
There are still some functions which could not be ported to the new range index yet, including
ft:matches,ft:idandft:idref. Moving the id functions into Lucene shouldn't be difficult;fn:matchesmay be trickiest due to the limited support for regular expressions in Lucene.
(Wolfgang, you meant fn:, not ft:, right?)
I believe the fn:matches issue is still a problem. Specifically, Lucene regex patterns are anchored by default, i.e., if you provide regex pattern FOO, Lucene interprets this as ^FOO$. The old range index could handle fn:matches filters (i.e., the queries were backed by the index), but the new Lucene-based range indexes are not compatible with https://www.w3.org/TR/xmlschema-2/#regexs, so they aren't index-backed. (The most readable info I've been able to find on regex in Lucene is actually for elasticsearch; they do specify when the info is Lucene-specific, but note they're talking about Lucene 8. See https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html.)
Could one workaround be to allow the use of fn:matches to be served by the Lucene-based range index in cases when $pattern is explicitly anchored (i.e., starts with ^ and ends with $)? The documentation could note that if the $pattern isn't explicitly anchored, eXist will fall back to a search of the DOM, and thus receive no benefit from the range index?
I believe the fn:matches issue is still a problem. Specifically, Lucene regex patterns are anchored by default, i.e., if you provide regex pattern FOO, Lucene interprets this as ^FOO$
I wonder if we could not just translate the XQuery regex for fn:matches into a suitable Lucene 4 RegExp?
Could we not just automatically rewrite the regular expression from e.g. fn:matches("BARFOOBAR", "FOO") to .+FOO.+ and send that to Lucene?
@wolfgangmm reports that fn:id and fn:id-ref also use the Native Value Index, and would need to be migrated.
@wolfgangmm In last week's community call you said you would check to make sure that xml:id attributes were being automatically indexed; once we're sure this is so, if @adamretter can be assured that we do want to remove the legacy range index, he is willing to update the PR and remove it.
@wolfgangmm By chance, have you been able to check whether @xml:id attributes are being automatically indexed now in eXist 5?
The last time we discussed removing the legacy range index, that question was your key concern.
Besides that question, were there any other concerns you had about removing the legacy range index?
I'm just checking in on the status of this because it's a move that would require a new major version—so perhaps we could consider it as part of eXist 6?