exist icon indicating copy to clipboard operation
exist copied to clipboard

[BUG] year portion of xs:date stored in Lucene Fields

Open line-o opened this issue 3 years ago • 2 comments

Describe the bug

When a LuceneField index is configured to be of type xs:date it is stored internally as a LuceneLongfield. The year portion has an upper and lower limit (0-131071). The XQuery specification allows xs:date to have arbitrarily large years.

As output of the below code clearly demonstrates this is also the case in existdb.

(
    "-1444153-08-07",
    "-2022-08-07",
    "2022-08-07",
    "1444153-08-07"
)
! xs:date(.)
=> sort()

Attempting to achieve the same with field values stored as xs:date does not work.

Example data:

<items>
<!-- negative values cannot be retrieved from index with correct values -->
    <item date="-2022-02-22"/>
    <item date="-1000-01-01"/>
    <item date="-1-01-01"/>

<!-- these work -->
    <item date="2022-08-07"/>
    <item date="131071-08-07"/>

<!-- leads to error attempting to cast read value to date -->
    <item date="131072-08-07"/>

<!-- seems to be dropped from index -->
    <item date="2147483647-08-07"/>
    <item date="0-01-01"/>
</items>

Example collection.xconf

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <lucene>
            <text qname="item">
                <field name="date" expression="@date" type="xs:date"/>
            </text>
        </lucene>
    </index>
</collection>

While investigating this issue it also turned out that the current implementation does not handle negative values for the year portion. The value is likely encoded as an unsigned 16 bit integer.

This was uncovered by the unit tests for lucene fields added to #4253

Expected behavior

The limitation to be

  • documented
  • an error to be raised when the index writer encounters an xs:date that cannot be stored lossless with the current implementation
  • negative years to be supported

Alternatively we should research a way to store unbounded values for the year portion of dates.

To Reproduce

xquery version "3.1";

module namespace lfdyt="http://exist-db.org/xquery/lucene-field-date-year-test";

declare namespace test="http://exist-db.org/xquery/xqsuite";

declare variable $lfdyt:data := document {
<items>
<!-- negative values cannot be retrieved from index with correct values -->
    <item date="-2022-02-22"/>
    <item date="-1000-01-01"/>
    <item date="-0001-01-01"/>

<!-- these work -->
    <item date="2022-08-07"/>
    <item date="131071-08-07"/>

<!-- leads to error attempting to cast read value to date -->
    <item date="131072-08-07"/>

<!-- seems to be dropped from index -->
    <item date="2147483647-08-07"/>
</items>
};

declare variable $lfdyt:xconf :=
<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <lucene>
            <text qname="item">
                <field name="date" expression="@date" type="xs:date"/>
            </text>
        </lucene>
    </index>
</collection>;

declare variable $lfdyt:collection := "lucene-field-date-timezone-test";

declare
    %test:setUp
function lfdyt:setup() {
    let $testCol := xmldb:create-collection("/db", $lfdyt:collection)
    let $indexCol := xmldb:create-collection("/db/system/config/db", $lfdyt:collection)
    return (
        xmldb:store("/db/" || $lfdyt:collection, "test.xml", $lfdyt:data),
        xmldb:store("/db/system/config/db/" || $lfdyt:collection, "collection.xconf", $lfdyt:xconf)
    )
};

declare
(:    %test:tearDown:)
function lfdyt:tearDown() {
    xmldb:remove("/db/" || $lfdyt:collection),
    xmldb:remove("/db/system/config/db/" || $lfdyt:collection),
    xmldb:reindex("/db/" || $lfdyt:collection)
};

(: all date values can be stored and queried :)
declare
    %test:assertEquals("-2022-02-22", "-1000-01-01", "-0001-01-01", "2022-08-07", "131071-08-07", "131072-08-07", "2147483647-08-07")
function lfdyt:all-field-values-are-indexed() {
    collection("/db/" || $lfdyt:collection)
        //item[
            ft:query(., "date:*", map{ 
                "leading-wildcard": "yes",
                "fields": "date"
            })]
        ! ft:field(., "date", "xs:date")
        => sort()
};

Tested on:

  • OS: MacOS 12.5.1
  • eXist-db version: 6.1.0-SNAPSHOT
  • Java Version Java8u342

Additional context

  • How is eXist-db installed? built from source: commit f3424d7c44d2cc7d8eb722e0701709de6e497a34 (develop HEAD)
  • Any custom changes in e.g. conf.xml? none

line-o avatar Sep 02 '22 12:09 line-o

A general remark: we should not forget that the Lucene index is not a general purpose index but has a very specific scope, which is to search through text! Fields of other types are to be used to quickly filter or sort the results, but they should not be seen as an alternative to a range index, which supports the full range of XQuery data types. Typically one would create one or two fields in addition to a full text index on an element to allow for quick sorting or display of pre-computed strings. Using fields in places where a range index would be more appropriate simply means abusing the feature and this should be clearly documented. We should not try to twist Lucene too much into something for which is was not designed.

Lucene fields are based on Lucene's capabilities and scope, which - as is reasonable for a text retrieval engine - are narrower than those of XQuery. This is completely fine as we access those fields via extension functions and not the default XQuery operators or functions. From a user perspective, it is to be expected that an extension may convert data types into a more limited range. This just needs to be documented.

wolfgangmm avatar Sep 04 '22 09:09 wolfgangmm

Yes, very good points, @wolfgang! In my frenzy checking typed Lucene field limitations I might have gone one step too far. I also think the best way forward is to clearly communicate the use-cases and limitations and were a developer is better off using alternatives, I totally agree.

line-o avatar Sep 04 '22 09:09 line-o