exist
exist copied to clipboard
Fix lucene field query issues
This PR will address 3 problems when using Lucene fields for sorting or iterating large node sets (> 100.000 elements):
- eXist-db quickly runs out of memory: this happens as
ft:queryretrieves the fields for the entire result set at once and keeps them attached to each item in the set, no matter if the fields content is in fact accessed or not - Because fields are retrieved at query time, users have to know in advance which fields will be required later and list them in the
ft:queryoptions - Fixing 1 and 2 will further slow down field-based sort operations on large node sets. In general, those operations were already slow, but performance will degrade even more if we do not preload the fields.
The PR effectively fixes 1 by not preloading any fields. Instead, the loading will be deferred until ft:field is called. As a consequence, users no longer need to specify the fields to retrieve at query time, so 2 is solved as well (this doesn't change the API: you can still define fields in ft:query options, though the setting is unnecessary and will be ignored).
To address 3, an additional configuration attribute is introduced to field definitions in collection.xconf: adding binary="yes" will result in the field to be stored as a BinaryDocValue, which guarantees the fastest possible access in Lucene. This follows the example of other Lucene-based implementations (e.g. elastic search), which provide a similar option. Binary fields can only be retrieved, but not queried like other fields.
~~For strings and simple numeric types,~~ performance should be outstanding. ~~A bottleneck remains for complex data types though, in particular xs:date and xs:dateTime. Here the overhead to cast the stored binary field into a proper XQuery xs:date is always considerable, though not to be avoided. We're still trying to find a way to address this.~~
Connected documentation PR: https://github.com/eXist-db/documentation/pull/760
The 2 windows failures resemble those in #4264 (suggesting the issue isn't with these PRs per se, but that there's some existing issue with the develop branch in their interaction with the Windows environment), but the failure unique to this PR is Codacy's code smell check:
https://sonarcloud.io/project/issues?id=eXist-db_exist&pullRequest=4253&resolved=false&types=CODE_SMELL
There are 8 issues, 4 labeled critical, 2 major, and 2 minor.
I am going to need some time to review this - I have some initial concerns about the performance profile change. Although until I test it, I have no idea what to expect.
@wolfgangmm Do you have any benchmarks that you can share with us please?
Just take a large collection containing an xs:dateTime value by which you can sort entries and define an index field on the dates. Without this PR, you'll soon hit memory limits and sorting 200000 entries may take > 20s. With the PR applied, memory consumption stays low and - if you define the index to be "binary" - performance becomes acceptable.
Test query against all of https://github.com/HistoryAtState/frus/tree/master/volumes:
for $div in collection('/db/frus')//tei:div[ft:query(., 'hsg-fulltext:washington')]
let $date := ft:binary-field($div, 'hsg-date-min', 'xs:dateTime')
order by $date ascending
return
$date
with an index defined as follows:
<text qname="tei:div">
<field name="hsg-date-min" if="@type = ('section', 'document')" expression="@frus:doc-dateTime-min" type="xs:dateTime" binary="true"/>
</text>
should return in less than a second on average instead of crashing the database, which it did before.
@wolfgangmm Thanks - so the test that you mentioned is for the improvement. I will try it out... Can you also suggest a test for the decrease in performance in other areas that you mentioned please?
You can test the pre-PR state with the same data. Note though that the number of fields defined on the same elements has an influence (see below).
for $div in collection('/db/frus')//tei:div[ft:query(., 'hsg-fulltext:washington', map { 'fields': 'hsg-date-min' })]
let $date := ft:field($div, 'hsg-date-min', 'xs:dateTime')
order by $date ascending
return
$date
While performance of this is a problem, the main showstopper is memory consumption. It has been bearable if you used an xs:string index, but it's still likely you'll run out of memory when evaluating several queries in parallel. This is fixed by not pre-caching the fields at the cost of a slight delay accessing ft:field. The difference is very minor though. Performance was bad in any case with results clocking in after more than 20s.
So if accessing a particular field of every indexed node in a sequence with > 100,000 entries is crucial, switching to the "binary" column-stride store of Lucene solves the problem. I have certainly tried out half a dozen other approaches and studied Lucene's source code to understand how fields are stored. The main bottleneck seems to be that Lucene normally stores fields in a sequential list, so it has to walk the list to find a particular field. If your field happens to be at the end, you're basically screwed. Access to the column-stride store does not suffer from this limitation.
As discussed in the last community call I addressed the code smells in this PR with my recent commits. The one code smell identified as critical that is left is semantically not useful and the score is A regardless. I consider my work on this commit as done.
Thanks, @wolfgangmm and @line-o for the additional info and commits.
Here are my steps, based on Wolfgang's but filling in some more detail, that demonstrate some of the performance gains.
- Download or clone the https://github.com/HistoryAtState/frus repository.
- Build the package with
ant. - Using develop HEAD, install the package.
- Replace /db/system/config/db/frus/volumes/volumes.xconf with the following
<collection xmlns="http://exist-db.org/collection-config/1.0"> <index xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:frus="http://history.state.gov/frus/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <!-- Lucene index configuration --> <lucene> <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"> <!-- Specify stop words - or remove them entirely --> <param name="stopwords" type="org.apache.lucene.analysis.util.CharArraySet"> <!--<value>the</value>--> </param> </analyzer> <text qname="tei:div"> <field name="hsg-fulltext" if="@type = ('section', 'document')" store="no"/> <field name="hsg-date-min" if="@type = ('section', 'document')" expression="@frus:doc-dateTime-min" type="xs:string"/> </text> </lucene> </index> </collection> - Run the following query to reindex the /db/apps/frus/volumes collection:
xquery version "3.1"; let $start := util:system-time() return ( $start, xmldb:reindex("/db/apps/frus/volumes"), util:system-time(), util:system-time() - $start ) - Run the following query to count the # of results and obtain timing:
The duration on my system (item 3) is xs:dayTimeDuration("PT1.436S"), and the # of results is 178,926.xquery version "3.1"; declare namespace tei="http://www.tei-c.org/ns/1.0"; let $start := util:system-time() let $results := for $div in collection('/db/apps/frus/volumes')//tei:div[ft:query(., 'hsg-fulltext:washington', map { 'fields': 'hsg-date-min' })] let $date := ft:field($div, 'hsg-date-min', 'xs:string') order by $date ascending return $date return ( $start, util:system-time(), util:system-time() - $start, count($results) ) - Rebuilding eXist using this PR branch (I even rebased against develop for good measure), replace volumes.conf with the following xconf, which changes the
field/@typefrom xs:string to xs:dateTime and addsfield/@binary="true":<collection xmlns="http://exist-db.org/collection-config/1.0"> <index xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:frus="http://history.state.gov/frus/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <!-- Lucene index configuration --> <lucene> <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"> <!-- Specify stop words - or remove them entirely --> <param name="stopwords" type="org.apache.lucene.analysis.util.CharArraySet"> <!--<value>the</value>--> </param> </analyzer> <text qname="tei:div"> <field name="hsg-fulltext" if="@type = ('section', 'document')" store="no"/> <field name="hsg-date-min" if="@type = ('section', 'document')" expression="@frus:doc-dateTime-min" type="xs:dateTime" binary="true"/> </text> </lucene> </index> </collection> - Reindex using the script in step 6.
- Run the following script, which is a variant of the one from step 7 and switches from ft:field to ft:binary-field:
This returns the same number of results as in step 7, but the results come back in 0.445s - compared to 1.436s - .99s faster.xquery version "3.1"; declare namespace tei="http://www.tei-c.org/ns/1.0"; let $start := util:system-time() let $results := for $div in collection('/db/apps/frus/volumes')//tei:div[ft:query(., 'hsg-fulltext:washington')] let $date := ft:binary-field($div, 'hsg-date-min', 'xs:dateTime') order by $date ascending return $date return ( $start, util:system-time(), util:system-time() - $start, count($results) )
I have not tested the query under load, but based on Wolfgang's statement above, this PR branch also performs better under load.
@wolfgangmm Amongst some minor optimisations and code cleanup, I also found some issues with missing data types and truncated type conversion. I can push these to this branch if you wish? In the first instance I attach a series of 13 patches for your review.
If you are happy with these, I can also send some more patches to reduce the storage size of the fields...
To test these, you can apply them using git am -3. Cheers
@adamretter Well, this does a lot of refactoring on the way while I tried to limit the PR to its narrow scope. But the changes should be good and without risk as far as I could see going through each of them, so I'm fine to have them applied.
Likewise I did not think about using a bit-encoding as I would expect the performance win to be likely not noticeable, given that the number of date and time fields in a collection will usually be rather small and decoding may also cost time. We can try it though, but then would need to test again as described by Joe - just to be sure the original performance issues remain solved.
If I recall, it's only the last patch that reorganised some functions. The other patches are applied before that and address genuine issues, so you could avoid the reorganisation and just apply the bug fixes if you prefer.
I haven't included the bit encoding in this patch set. If you are happy with these I can send that as an additional add on patch set. Regards size I had assumed that when there are more than a few hundred thousand docs in a collection then the few bytes difference quickly compounds into a lot of bytes we don't need.
Ok, I'd say just commit the current set of patches as they are (should all be fine) and then maybe the bit encoding could go into a follow up PR, so @joewiz can retest it separately. Might be easiest.
@wolfgangmm Okay I pushed the additional commits.
I would rather get the storage format fixed in the initial PR otherwise the binary format on disk will change making the data stored in the Lucene index files non-portable between this and a newer version. I hope to have some time next week to also push commits for the smaller storage improvements.
In addition there is still an outstanding issue for xs:integer and its storage in this PR. I note that this issue also appears in the NativeStructuralIndex. As we know the W3C XQuery spec says that the size of xs:integer is unbounded, but the current code truncates it at 2^64. I will see if I can't fix that also...
@dizzzz Yes sorry, not quite finished yet...
Re-running my tests to compare the performance of the current develop branch vs. this PR (rebased against the current develop branch)—but also adding a run with a wildcard to return all results (not just those containing the word "washington") and averaging the result of 5 tests each—I get the following results:
| branch | query | result count | time |
|---|---|---|---|
| develop | "washington" |
178,926 | 0.8s |
| PR | " | " | 0.4s |
| develop | "*" |
312,832 | 1.9s |
| PR | " | " | 1.2s |
The PR still shows the same ~40-50% improvement—and of course, the same number of results.
What is the status of this PR? After resolving the conflicts and applying it to the current develop branch I was surprised to learn that the tests no longer pass the lucene testsuite.
[ERROR] Tests run: 348, Failures: 234, Errors: 3, Skipped: 32
May I ask to give this PR some priority again by either finalizing the started modifications and fixing test failures or restoring the original state of the PR, which worked fine? Further improvements can be applied in another iteration if needed, but the broken state makes it impossible to even do more testing.
Could we merge as of https://github.com/eXist-db/exist/pull/4253/commits/8ae49e28149cb6d7d7727ab6dc431c3ff528ba64 – which seems to be a working version with tests passing – and move the refactoring part to a new PR?
This may enable us to get the important part of the change into eXist 6.1.0 – added stability by avoiding OOM it will likely be very helpful to users – while doing the refactoring and possible improvements and optimizations later on.
@adamretter @line-o @wolfgangmm Would that be a workable solution?
Looking at the commits I would like to see https://github.com/eXist-db/exist/pull/4253/commits/737228765b4f8f31551a6e8f86b1f9db9d5b703e merged
Later commits add additional features like xs:short support which can go into a separate PR
@dariok suggested to merge https://github.com/eXist-db/exist/pull/4253/commits/101f8ce40cc544529576ba747e5e432e435354af (the new hash for the same commit)
Please open a new PR with additional changes to typed Lucene fields. Closing this PR, now that #4541 was merged.










