exist icon indicating copy to clipboard operation
exist copied to clipboard

[BUG] xmldb:reindex($collection-uri, $doc-uri) always adds new entries in Lucene index

Open MarkusWb opened this issue 3 years ago • 2 comments

Is related to issue #957

Describe the bug Applying a reindex on a document by calling the XQuery function

xmldb:reindex($collection-uri, $doc-uri)

always adds a new document entry in a Lucene index.

I skipped through the exist code with a debugger and found that

  • for the reindex of a collection old Lucene indexes are always erased before rebuilding them (IndexController.removeCollection).
  • for the update of nodes this is done by setting a flag IndexController.reindexing and removing the nodes to update before rewriting them in LuceneIndexWorker.write.
  • but neither is done for reindexing a document.

Regarding our use case:
We use the Lucene index to track references to a document from other documents

<lucene>
    <module uri="http://awb.saw-leipzig.de/xquery/facet-utils" prefix="fu" at="xmldb:exist:////db/projects/awb/scripts/facet-utils.xq"/>
    <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
    <text qname="task">
        <field name="numberAutoInstances" expression="fu:countInstancesPointingToTaskWithId(./@id)"/>
        <facet dimension="hasInstances" expression="fu:countInstancesPointingToTaskWithId(./@id) &gt; 0)"/>
    </text>
</lucene>

We trigger the reindex of a referenced file in a trigger method trigger:before-update-document($uri as xs:anyURI) on the referencing document.

As can be seen in the screenshot of the Luke Index Browser, after removing and adding the reference in the referencing document several times, each time a new document entry with the same docNodeId was added. But a simple call of xmldb:reindex($collection-uri, $doc-uri) also leads to new entries.

lucene_index_facets

Expected behavior A reindex should update existing entries in the Lucene index.

Supposed fix Call IndexController.setReindexing(true); in the call chain of reindexDocument(...) methods or use IndexWriter.updateDocument instead of IndexWriter.addDocument in the method LuceneIndexWorker.write.

To Reproduce

  1. Create a collection /db/projects/test with two documents:
  • test1.xml
<root id="1">LuceneTest1</root>
  • test2.xml
<root id="2"><child/>LuceneTest2</root>
  1. Create a configuration file /db/system/config/db/projects/test/collection.xconf for this collection:
<collection xmlns="http://exist-db.org/collection-config/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xmldb="http://exist-db.org/xquery/xmldb">
    <index>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <text qname="root">
                <facet dimension="testFacet" expression="empty(./*)"/>
            </text>
        </lucene>
    </index>
</collection>
  1. Call this XQuery
xquery version "3.1";

let $ri0 := xmldb:reindex('/db/projects/test')
let $ri1_1 := xmldb:reindex('/db/projects/test', 'test1.xml')
let $ri2_1 := xmldb:reindex('/db/projects/test', 'test1.xml')
let $ri1_2 := xmldb:reindex('/db/projects/test', 'test2.xml')
let $ri2_2 := xmldb:reindex('/db/projects/test', 'test2.xml')

let $options := map {
  'facets': map {
    'testFacet': ()
  }
}

let $results := collection('/db/projects/test')//root[ft:query(., (), $options)]

let $testFacet := ft:facets($results, 'testFacet', ())
return <result>{
    element facets {
      attribute dimension {'testFacet'},
      map:for-each($testFacet, function($label, $count) {
        element facet {
          attribute v {$label},
          attribute n {$count}
        }
      })
    }
}</result>
  1. The result counts 3 instances for each document instead of 1:
<result>
    <facets dimension="testFacet">
        <facet v="true" n="3"/>
        <facet v="false" n="3"/>
    </facets>
</result>

Context:

  • Win10
  • eXist-db version: 5.3.0
  • Java Version: openJDK 15.0.2

Additional context

  • eXist-db installed from ZIP

MarkusWb avatar Jul 17 '21 22:07 MarkusWb

Here is an XQSuite test that further simplifies the test supplied above. It demonstrates that each time we reindex using xmldb:reindex#2, the number of facet hits is incremented. However, when we use xmldb:reindex#1, the correct results are returned. Similarly, the expected number of hits is returned on a plain call to ft:query, regardless of which reindex function is called.

Thus, there is an issue with xmldb:reindex#2 and the count returned by ft:facet.

xquery version "3.1";

module namespace t="http://exist-db.org/xquery/test";

declare namespace test="http://exist-db.org/xquery/xqsuite";

declare variable $t:XML := document {
    <root>foo</root>
};

declare variable $t:xconf := <collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <text qname="root">
                <facet dimension="test-facet" expression="'bar'"/>
            </text>
        </lucene>
    </index>
</collection>;

declare
    %test:setUp
function t:setup() {
    let $testCol := xmldb:create-collection("/db", "test")
    let $indexCol := xmldb:create-collection("/db/system/config/db", "test")
    return
        (
            xmldb:store("/db/test", "test.xml", $t:XML),
            xmldb:store("/db/system/config/db/test", "collection.xconf", $t:xconf),
            xmldb:reindex("/db/test")
        )
};

declare
    %test:tearDown
function t:tearDown() {
    xmldb:remove("/db/test"),
    xmldb:remove("/db/system/config/db/test")
};

declare
    %test:assertEquals("1", "1", "1", "1", "1")
function t:facets-after-reindex-arity-2() {
    let $reindex := xmldb:reindex("/db/test")
    for $i in (1 to 5)
    let $hits := collection("/db/test")//root[ft:query(., ())]
    let $facets := ft:facets($hits, "test-facet")
    let $reindex-doc := xmldb:reindex("/db/test", "test.xml")
    return 
        $facets?bar
};

declare
    %test:assertEquals("1", "1", "1", "1", "1")
function t:facets-after-reindex-arity-1() {
    let $reindex := xmldb:reindex("/db/test")
    for $i in (1 to 5)
    let $hits := collection("/db/test")//root[ft:query(., ())]
    let $facets := ft:facets($hits, "test-facet")
    let $reindex-col := xmldb:reindex("/db/test")
    return 
        $facets?bar
};

declare
    %test:assertEquals("1", "1", "1", "1", "1")
function t:hits-after-reindex-arity-2() {
    let $reindex := xmldb:reindex("/db/test")
    for $i in (1 to 5)
    let $hits := collection("/db/test")//root[ft:query(., ())]
    let $reindex-doc := xmldb:reindex("/db/test", "test.xml")
    return 
        count($hits)
};

declare
    %test:assertEquals("1", "1", "1", "1", "1")
function t:hits-after-reindex-arity-1() {
    let $reindex := xmldb:reindex("/db/test")
    for $i in (1 to 5)
    let $hits := collection("/db/test")//root[ft:query(., ())]
    let $reindex-col := xmldb:reindex("/db/test")
    return 
        count($hits)
};

This test suite returns the following results:

<testsuite package="http://exist-db.org/xquery/test" timestamp="2021-07-25T16:19:46.77-04:00"
    tests="4" failures="1" errors="0" pending="0" time="PT0.207S">
    <testcase name="facets-after-reindex-arity-1" class="t:facets-after-reindex-arity-1"/>
    <testcase name="facets-after-reindex-arity-2" class="t:facets-after-reindex-arity-2">
        <failure message="assertEquals failed." type="failure-error-code-1">1 1 1 1 1</failure>
        <output>1 2 3 4 5</output>
    </testcase>
    <testcase name="hits-after-reindex-arity-1" class="t:hits-after-reindex-arity-1"/>
    <testcase name="hits-after-reindex-arity-2" class="t:hits-after-reindex-arity-2"/>
</testsuite>

I used eXist 5.3.0 1934cd7cd0c0ff3decac0b770969cab435409e52 20210626123843.

joewiz avatar Jul 25 '21 20:07 joewiz

Thus, there is an issue with xmldb:reindex#2 and the count returned by ft:facet.

I guess there is no issue with ft:facet. Actually this one works as expected in the sense of counting all indexed objects. The problem is, that there are more objects in the index than there should be, caused by just adding new objects to the index if xmldb:reindex#2 is used.

The test could also be implemented (or may be extended) by using ft:field like:

  1. Add test.xml with <root><foo>bar</foo></root>
  2. Add collection.xconf defining field <field name="foo" expression=".//foo"/>
  3. Reindex with xmldb:reindex#1
  4. Search //root[ft:query(., 'foo:bar')] → get 1 results
  5. Read field with ft.field($result, 'foo') → bar
  6. Search //root[ft:query(., 'foo:foo')] → get 0 results
  7. Change content in <foo> from 'bar' to 'foo'
  8. Reindex with xmldb:reindex#2
  9. Search //root[ft:query(., 'foo:bar')] → get 1 results (Expected: 0)
  10. Read field with ft.field($result, 'foo') → bar
  11. Search //root[ft:query(., 'foo:foo')] → get 1 results
  12. Read field with ft.field($result, 'foo') → foo

This is because both 'states' was indexed and saved into separate index objects (added without deletion of existing record) still co-existing in the index database. The result count or build up by ft:query somehow deduplicates, because if you search for objects containing with 'foo' or 'bar' in field foo, you also only get one record.

ukretschmer avatar Jul 26 '21 16:07 ukretschmer