exist icon indicating copy to clipboard operation
exist copied to clipboard

Fix lucene field query issues

Open wolfgangmm opened this issue 3 years ago • 19 comments

This PR will address 3 problems when using Lucene fields for sorting or iterating large node sets (> 100.000 elements):

  1. eXist-db quickly runs out of memory: this happens as ft:query retrieves the fields for the entire result set at once and keeps them attached to each item in the set, no matter if the fields content is in fact accessed or not
  2. Because fields are retrieved at query time, users have to know in advance which fields will be required later and list them in the ft:query options
  3. Fixing 1 and 2 will further slow down field-based sort operations on large node sets. In general, those operations were already slow, but performance will degrade even more if we do not preload the fields.

The PR effectively fixes 1 by not preloading any fields. Instead, the loading will be deferred until ft:field is called. As a consequence, users no longer need to specify the fields to retrieve at query time, so 2 is solved as well (this doesn't change the API: you can still define fields in ft:query options, though the setting is unnecessary and will be ignored).

To address 3, an additional configuration attribute is introduced to field definitions in collection.xconf: adding binary="yes" will result in the field to be stored as a BinaryDocValue, which guarantees the fastest possible access in Lucene. This follows the example of other Lucene-based implementations (e.g. elastic search), which provide a similar option. Binary fields can only be retrieved, but not queried like other fields.

~~For strings and simple numeric types,~~ performance should be outstanding. ~~A bottleneck remains for complex data types though, in particular xs:date and xs:dateTime. Here the overhead to cast the stored binary field into a proper XQuery xs:date is always considerable, though not to be avoided. We're still trying to find a way to address this.~~

Connected documentation PR: https://github.com/eXist-db/documentation/pull/760

wolfgangmm avatar Feb 24 '22 13:02 wolfgangmm

The 2 windows failures resemble those in #4264 (suggesting the issue isn't with these PRs per se, but that there's some existing issue with the develop branch in their interaction with the Windows environment), but the failure unique to this PR is Codacy's code smell check:

https://sonarcloud.io/project/issues?id=eXist-db_exist&pullRequest=4253&resolved=false&types=CODE_SMELL

There are 8 issues, 4 labeled critical, 2 major, and 2 minor.

joewiz avatar Mar 11 '22 21:03 joewiz

I am going to need some time to review this - I have some initial concerns about the performance profile change. Although until I test it, I have no idea what to expect.

@wolfgangmm Do you have any benchmarks that you can share with us please?

adamretter avatar Mar 12 '22 08:03 adamretter

Just take a large collection containing an xs:dateTime value by which you can sort entries and define an index field on the dates. Without this PR, you'll soon hit memory limits and sorting 200000 entries may take > 20s. With the PR applied, memory consumption stays low and - if you define the index to be "binary" - performance becomes acceptable.

Test query against all of https://github.com/HistoryAtState/frus/tree/master/volumes:

for $div in collection('/db/frus')//tei:div[ft:query(., 'hsg-fulltext:washington')]
let $date := ft:binary-field($div, 'hsg-date-min', 'xs:dateTime')
order by $date ascending
return
    $date

with an index defined as follows:

<text qname="tei:div">
    <field name="hsg-date-min" if="@type = ('section', 'document')" expression="@frus:doc-dateTime-min" type="xs:dateTime" binary="true"/>
</text>

should return in less than a second on average instead of crashing the database, which it did before.

wolfgangmm avatar Mar 14 '22 16:03 wolfgangmm

@wolfgangmm Thanks - so the test that you mentioned is for the improvement. I will try it out... Can you also suggest a test for the decrease in performance in other areas that you mentioned please?

adamretter avatar Mar 14 '22 17:03 adamretter

You can test the pre-PR state with the same data. Note though that the number of fields defined on the same elements has an influence (see below).

for $div in collection('/db/frus')//tei:div[ft:query(., 'hsg-fulltext:washington', map { 'fields': 'hsg-date-min' })]
let $date := ft:field($div, 'hsg-date-min', 'xs:dateTime')
order by $date ascending
return
    $date

While performance of this is a problem, the main showstopper is memory consumption. It has been bearable if you used an xs:string index, but it's still likely you'll run out of memory when evaluating several queries in parallel. This is fixed by not pre-caching the fields at the cost of a slight delay accessing ft:field. The difference is very minor though. Performance was bad in any case with results clocking in after more than 20s.

So if accessing a particular field of every indexed node in a sequence with > 100,000 entries is crucial, switching to the "binary" column-stride store of Lucene solves the problem. I have certainly tried out half a dozen other approaches and studied Lucene's source code to understand how fields are stored. The main bottleneck seems to be that Lucene normally stores fields in a sequential list, so it has to walk the list to find a particular field. If your field happens to be at the end, you're basically screwed. Access to the column-stride store does not suffer from this limitation.

wolfgangmm avatar Mar 15 '22 08:03 wolfgangmm

As discussed in the last community call I addressed the code smells in this PR with my recent commits. The one code smell identified as critical that is left is semantically not useful and the score is A regardless. I consider my work on this commit as done.

line-o avatar Mar 16 '22 14:03 line-o

Thanks, @wolfgangmm and @line-o for the additional info and commits.

Here are my steps, based on Wolfgang's but filling in some more detail, that demonstrate some of the performance gains.

  1. Download or clone the https://github.com/HistoryAtState/frus repository.
  2. Build the package with ant.
  3. Using develop HEAD, install the package.
  4. Replace /db/system/config/db/frus/volumes/volumes.xconf with the following
    <collection xmlns="http://exist-db.org/collection-config/1.0">
        <index xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:frus="http://history.state.gov/frus/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
            <!-- Lucene index configuration -->
            <lucene>
                <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer">
                    <!-- Specify stop words - or remove them entirely -->
                    <param name="stopwords" type="org.apache.lucene.analysis.util.CharArraySet">
                        <!--<value>the</value>-->
                    </param>
                </analyzer>
                <text qname="tei:div">
                    <field name="hsg-fulltext" if="@type = ('section', 'document')" store="no"/>
                    <field name="hsg-date-min" if="@type = ('section', 'document')" expression="@frus:doc-dateTime-min" type="xs:string"/>
                </text>
            </lucene>
    
        </index>
    </collection>
    
  5. Run the following query to reindex the /db/apps/frus/volumes collection:
    xquery version "3.1";
    
    let $start := util:system-time()
    return
        (
            $start,
            xmldb:reindex("/db/apps/frus/volumes"),
            util:system-time(),
            util:system-time() - $start
        )
    
  6. Run the following query to count the # of results and obtain timing:
    xquery version "3.1";
    
    declare namespace tei="http://www.tei-c.org/ns/1.0";
    
    let $start := util:system-time()
    let $results := 
        for $div in collection('/db/apps/frus/volumes')//tei:div[ft:query(., 'hsg-fulltext:washington', map { 'fields': 'hsg-date-min' })]
        let $date := ft:field($div, 'hsg-date-min', 'xs:string')
        order by $date ascending
        return
            $date
    return
        (
            $start,
            util:system-time(),
            util:system-time() - $start,
            count($results)
        )
    
    The duration on my system (item 3) is xs:dayTimeDuration("PT1.436S"), and the # of results is 178,926.
  7. Rebuilding eXist using this PR branch (I even rebased against develop for good measure), replace volumes.conf with the following xconf, which changes the field/@type from xs:string to xs:dateTime and adds field/@binary="true":
    <collection xmlns="http://exist-db.org/collection-config/1.0">
        <index xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:frus="http://history.state.gov/frus/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
            <!-- Lucene index configuration -->
            <lucene>
                <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer">
                    <!-- Specify stop words - or remove them entirely -->
                    <param name="stopwords" type="org.apache.lucene.analysis.util.CharArraySet">
                        <!--<value>the</value>-->
                    </param>
                </analyzer>
                <text qname="tei:div">
                    <field name="hsg-fulltext" if="@type = ('section', 'document')" store="no"/>
                    <field name="hsg-date-min" if="@type = ('section', 'document')" expression="@frus:doc-dateTime-min" type="xs:dateTime" binary="true"/>
                </text>
            </lucene>
    
        </index>
    </collection>
    
  8. Reindex using the script in step 6.
  9. Run the following script, which is a variant of the one from step 7 and switches from ft:field to ft:binary-field:
    xquery version "3.1";
    
    declare namespace tei="http://www.tei-c.org/ns/1.0";
    
    let $start := util:system-time()
    let $results := 
        for $div in collection('/db/apps/frus/volumes')//tei:div[ft:query(., 'hsg-fulltext:washington')]
        let $date := ft:binary-field($div, 'hsg-date-min', 'xs:dateTime')
        order by $date ascending
        return
            $date
    return
        (
            $start,
            util:system-time(),
            util:system-time() - $start,
            count($results)
        )
    
    This returns the same number of results as in step 7, but the results come back in 0.445s - compared to 1.436s - .99s faster.

I have not tested the query under load, but based on Wolfgang's statement above, this PR branch also performs better under load.

joewiz avatar Mar 18 '22 17:03 joewiz

@wolfgangmm Amongst some minor optimisations and code cleanup, I also found some issues with missing data types and truncated type conversion. I can push these to this branch if you wish? In the first instance I attach a series of 13 patches for your review.

If you are happy with these, I can also send some more patches to reduce the storage size of the fields...

To test these, you can apply them using git am -3. Cheers

patches.zip

adamretter avatar Apr 04 '22 23:04 adamretter

@adamretter Well, this does a lot of refactoring on the way while I tried to limit the PR to its narrow scope. But the changes should be good and without risk as far as I could see going through each of them, so I'm fine to have them applied.

Likewise I did not think about using a bit-encoding as I would expect the performance win to be likely not noticeable, given that the number of date and time fields in a collection will usually be rather small and decoding may also cost time. We can try it though, but then would need to test again as described by Joe - just to be sure the original performance issues remain solved.

wolfgangmm avatar Apr 11 '22 08:04 wolfgangmm

If I recall, it's only the last patch that reorganised some functions. The other patches are applied before that and address genuine issues, so you could avoid the reorganisation and just apply the bug fixes if you prefer.

I haven't included the bit encoding in this patch set. If you are happy with these I can send that as an additional add on patch set. Regards size I had assumed that when there are more than a few hundred thousand docs in a collection then the few bytes difference quickly compounds into a lot of bytes we don't need.

adamretter avatar Apr 11 '22 08:04 adamretter

Ok, I'd say just commit the current set of patches as they are (should all be fine) and then maybe the bit encoding could go into a follow up PR, so @joewiz can retest it separately. Might be easiest.

wolfgangmm avatar Apr 14 '22 07:04 wolfgangmm

@wolfgangmm Okay I pushed the additional commits.

I would rather get the storage format fixed in the initial PR otherwise the binary format on disk will change making the data stored in the Lucene index files non-portable between this and a newer version. I hope to have some time next week to also push commits for the smaller storage improvements.

In addition there is still an outstanding issue for xs:integer and its storage in this PR. I note that this issue also appears in the NativeStructuralIndex. As we know the W3C XQuery spec says that the size of xs:integer is unbounded, but the current code truncates it at 2^64. I will see if I can't fix that also...

adamretter avatar Apr 15 '22 08:04 adamretter

@dizzzz Yes sorry, not quite finished yet...

adamretter avatar Apr 25 '22 15:04 adamretter

Re-running my tests to compare the performance of the current develop branch vs. this PR (rebased against the current develop branch)—but also adding a run with a wildcard to return all results (not just those containing the word "washington") and averaging the result of 5 tests each—I get the following results:

branch query result count time
develop "washington" 178,926 0.8s
PR " " 0.4s
develop "*" 312,832 1.9s
PR " " 1.2s

The PR still shows the same ~40-50% improvement—and of course, the same number of results.

joewiz avatar Apr 25 '22 16:04 joewiz

What is the status of this PR? After resolving the conflicts and applying it to the current develop branch I was surprised to learn that the tests no longer pass the lucene testsuite.

[ERROR] Tests run: 348, Failures: 234, Errors: 3, Skipped: 32

line-o avatar Jul 08 '22 10:07 line-o

May I ask to give this PR some priority again by either finalizing the started modifications and fixing test failures or restoring the original state of the PR, which worked fine? Further improvements can be applied in another iteration if needed, but the broken state makes it impossible to even do more testing.

wolfgangmm avatar Jul 12 '22 07:07 wolfgangmm

Could we merge as of https://github.com/eXist-db/exist/pull/4253/commits/8ae49e28149cb6d7d7727ab6dc431c3ff528ba64 – which seems to be a working version with tests passing – and move the refactoring part to a new PR?

This may enable us to get the important part of the change into eXist 6.1.0 – added stability by avoiding OOM it will likely be very helpful to users – while doing the refactoring and possible improvements and optimizations later on.

@adamretter @line-o @wolfgangmm Would that be a workable solution?

dariok avatar Aug 01 '22 18:08 dariok

Looking at the commits I would like to see https://github.com/eXist-db/exist/pull/4253/commits/737228765b4f8f31551a6e8f86b1f9db9d5b703e merged Later commits add additional features like xs:short support which can go into a separate PR

line-o avatar Sep 05 '22 18:09 line-o

@dariok suggested to merge https://github.com/eXist-db/exist/pull/4253/commits/101f8ce40cc544529576ba747e5e432e435354af (the new hash for the same commit)

line-o avatar Sep 05 '22 18:09 line-o

Please open a new PR with additional changes to typed Lucene fields. Closing this PR, now that #4541 was merged.

line-o avatar Sep 22 '22 09:09 line-o