arcadedb SQL: Index can return too many entries when used with `ORDER BY`

ArcadeDB Version:

ArcadeDB Server v24.11.1-SNAPSHOT (build 4207821d055e37cbd74b76683e45073da759c2de/1729083296142/console)

OS and JDK Version:

Running on Mac OS X 12.7.6 - OpenJDK 64-Bit Server VM 17.0.12 (Homebrew)

Querying a database can return the wrong number of records, in particular too many records. This problem is not easy to reproduce so I attach a small database (75 records) generated from public data. This behavior occurs when from certain databases, such as the one provided below, is queried via SQL, ie SELECT, where an (not-unique) indexed property is used in the projection and also as ordering quantity (in ORDER BY). The schema is given here: https://github.com/ulbmuenster/dataasee/blob/main/database/schema.sql

Expected behavior

Correct number of records returned.

Actual behavior

Too many records are returned.

Steps to reproduce

Restore this backup: metadatalake-backup-20240829-100753793.zip (337KB)

SELECT count(*) FROM metadata; -- There are 75 records in the database

SELECT name, publicationYear FROM metadata; -- Returns 75 records correctly

SELECT name FROM metadata ORDER BY publicationYear; -- Returns 75 records correctly

SELECT name, publicationYear FROM metadata ORDER BY publicationYear; -- Returns 117 records INCORRECTLY

Notes

End of August (2024-08-28 -- 2024-08-31) @lvca (and I) looked into this and potential sources of this behavior could be:

FetchFromIndexStep
The UPSERT entering the data.

Here is a sample UPSERT statement that generated the provided database:

UPDATE metadata MERGE {"creators":[{"@type":"pair","name":"Neumann, Georg"},{"@type":"pair","name":"Boivin, Odette"},{"@type":"pair","name":"Kleber, Kristin"},{"@type":"pair","name":"Neumann, Georg"},{"@type":"pair","name":"Boivin, Odette"},{"@type":"pair","name":"Kleber, Kristin"}],"dataLocation":"https://data-management.uni-muenster.de/direct-access/wwurdm/07988641231","dataSteward":"https://datastore.uni-muenster.de","description":"The dataset comprises the RTI-data of the cuneiform tablet VAT 17966 (N5:27).\\n\\nRTI = Reflectance Transformation Imaging (RTI) is a computational photographic method that captures a subject’s surface shape and color and enables the interactive re-lighting of the subject from any direction in a software viewer, revealing details not visible with the naked eye (cf. https://culturalheritageimaging.org/).","identifiers":[{"@type":"pair","data":"10.17879/07988641225","name":"DE-6"},{"@type":"pair","data":"10.17879/07988641225","name":"null"}],"keywords":"Babylon,Cuneiform,Neo-Babylonian","language":"#4:0","metadataQuality":"Incomplete","name":"VAT 17966 - RTI dataset","publicationYear":2024,"publisher":"Universität Münster","rawChecksum":"TcrJlnTQYfugD2rROs2Y3g==","rawType":"marc21","recordId":"MTA0MzQ4NDE4NDMwODAyNDg3NA","resourceType":null,"rights":"CC BY-NC-SA 4.0","source":"https://datastore.uni-muenster.de/oai","synonyms":[{"@type":"pair","data":"GoviB","name":"Alternative Title"}]} UPSERT WHERE recordId == 'MTA0MzQ4NDE4NDMwODAyNDg3NA';

As a workaround the projection can be renamed, ie:

SELECT name, publicationYear AS pubYear FROM metadata ORDER BY publicationYear;

but this also excludes the use of the index!

Oct 16 '24 13:10 gramian

Using either UPDATE ... MERGE ... UPSERT or UPDATE ... CONTENT ... UPSERT produces too many index results.

Oct 17 '24 08:10 gramian

Here is some more testing which may help locate the source:

Both sorting directions ASC and DESC produce these duplicate results but different numbers: 146 for ASC, and 117 for DESC for a database with 81 records.
This index behavior occurs also for INSERT INTO ... CONTENT ...
Using DISTINCT removes the duplicates but this is no workaround as the in paged results the duplicates may be spread.

Nov 29 '24 12:11 gramian

More testing reveals:

Using multiple order keys like ORDER BY publicationYear, name returns the correct amount of results. @lvca This does also not use the publicationYear index, right?
The data type seems not to make a difference (tested with SHORT, INTEGER, and LONG).

Dec 11 '24 10:12 gramian

@robfrank Here is another test case using 25.1.1. It is the smallest I could produce yet with 39 records: metadatalake-backup-20250130-122441030.zip

The query: SELECT name FROM metadata ORDER BY publicationYear produces correctly 39 results, while the query SELECT name, publicationYear FROM metadata ORDER BY publicationYear returns 50 results.

(Reminder: these duplicates appear when a projected indexed property is also order key, here publicationYear)

And another observation:

Using REBUILD INDEX * causes the query SELECT name, publicationYear FROM metadata ORDER BY publicationYear to return 40 results. So the rebuild still produces the duplicate results, yet lesser.

I hope this can help.

Jan 30 '25 11:01 gramian

Here is some more info that may help:

The records are UPSERTed (see example above) via the HTTP command endpoint using the SQLscript language.
A Unique index is running on the property recordId which is of type string with the attributes (mandatory true, notnull true, readonly true, max 31)
Wrapping the UPSERTs in an extra transaction does not make a difference.

Feb 12 '25 08:02 gramian

Using a subquery as a workaround, like:

SELECT FROM (SELECT name, publicationYear FROM metadata) ORDER BY publicationYear

returns the correct number of results and still uses the index as it seems.

Apr 09 '25 12:04 gramian

Here is a simple case reproducing the duplication in the index:

CREATE DOCUMENT TYPE doc;
CREATE PROPERTY doc.num LONG;
CREATE INDEX ON doc (num) NOTUNIQUE;

then:

INSERT INTO doc SET num = 1
INSERT INTO doc SET num = 2
INSERT INTO doc SET num = 2
INSERT INTO doc SET num = 2

shows via:

SELECT FROM index:`doc[num]`

key    rid
[1]    #1:0
[2]    #1:3
[2]    #1:2
[2]    #1:1
[2]    #1:3
[2]    #1:2
[2]    #1:3

Now:

select num from doc where num > 0

returns all entries in the index, whereas:

select num from doc

returns only the actual records.

Sep 25 '25 13:09 gramian

Wrapping the inserts in a transaction:

BEGIN;
INSERT INTO doc SET num = 1;
INSERT INTO doc SET num = 2;
INSERT INTO doc SET num = 2;
INSERT INTO doc SET num = 2;
COMMIT;

prevents these duplicate entries in the example above.

Sep 28 '25 20:09 gramian

Another test to reproduce the issue by using the console (by @gramian):

 @Test
  public void testDuplicateEntries() throws IOException {
    assertThat(console.parse("create database duptest")).isTrue();

    assertThat(console.parse("CREATE DOCUMENT TYPE doc")).isTrue();
    assertThat(console.parse("CREATE PROPERTY doc.num LONG")).isTrue();
    assertThat(console.parse("CREATE INDEX ON doc (num) NOTUNIQUE")).isTrue();

    assertThat(console.parse("INSERT INTO doc SET num = 1")).isTrue();
    assertThat(console.parse("INSERT INTO doc SET num = 2")).isTrue();
    assertThat(console.parse("INSERT INTO doc SET num = 2")).isTrue();
    assertThat(console.parse("INSERT INTO doc SET num = 2")).isTrue();

    assertThat(console.getDatabase().query("sql", "SELECT count(*) AS count FROM index:`doc[num]`").next().<Long>getProperty("count")).isEqualTo(4);

  }

Sep 29 '25 20:09 lvca

Works! Very Cool. Thank you very much @lvca

Sep 30 '25 06:09 gramian