arcadedb icon indicating copy to clipboard operation
arcadedb copied to clipboard

SQL: Index can return too many entries when used with `ORDER BY`

Open gramian opened this issue 1 year ago • 6 comments

ArcadeDB Version:

ArcadeDB Server v24.11.1-SNAPSHOT (build 4207821d055e37cbd74b76683e45073da759c2de/1729083296142/console)

OS and JDK Version:

Running on Mac OS X 12.7.6 - OpenJDK 64-Bit Server VM 17.0.12 (Homebrew)

Querying a database can return the wrong number of records, in particular too many records. This problem is not easy to reproduce so I attach a small database (75 records) generated from public data. This behavior occurs when from certain databases, such as the one provided below, is queried via SQL, ie SELECT, where an (not-unique) indexed property is used in the projection and also as ordering quantity (in ORDER BY). The schema is given here: https://github.com/ulbmuenster/dataasee/blob/main/database/schema.sql

Expected behavior

Correct number of records returned.

Actual behavior

Too many records are returned.

Steps to reproduce

Restore this backup: metadatalake-backup-20240829-100753793.zip (337KB)

SELECT count(*) FROM metadata; -- There are 75 records in the database

SELECT name, publicationYear FROM metadata; -- Returns 75 records correctly

SELECT name FROM metadata ORDER BY publicationYear; -- Returns 75 records correctly

SELECT name, publicationYear FROM metadata ORDER BY publicationYear; -- Returns 117 records INCORRECTLY

Notes

End of August (2024-08-28 -- 2024-08-31) @lvca (and I) looked into this and potential sources of this behavior could be:

  • FetchFromIndexStep
  • The UPSERT entering the data.

Here is a sample UPSERT statement that generated the provided database:

UPDATE metadata MERGE {"creators":[{"@type":"pair","name":"Neumann, Georg"},{"@type":"pair","name":"Boivin, Odette"},{"@type":"pair","name":"Kleber, Kristin"},{"@type":"pair","name":"Neumann, Georg"},{"@type":"pair","name":"Boivin, Odette"},{"@type":"pair","name":"Kleber, Kristin"}],"dataLocation":"https://data-management.uni-muenster.de/direct-access/wwurdm/07988641231","dataSteward":"https://datastore.uni-muenster.de","description":"The dataset comprises the RTI-data of the cuneiform tablet VAT 17966 (N5:27).\\n\\nRTI = Reflectance Transformation Imaging (RTI) is a computational photographic method that captures a subject’s surface shape and color and enables the interactive re-lighting of the subject from any direction in a software viewer, revealing details not visible with the naked eye (cf. https://culturalheritageimaging.org/).","identifiers":[{"@type":"pair","data":"10.17879/07988641225","name":"DE-6"},{"@type":"pair","data":"10.17879/07988641225","name":"null"}],"keywords":"Babylon,Cuneiform,Neo-Babylonian","language":"#4:0","metadataQuality":"Incomplete","name":"VAT 17966 - RTI dataset","publicationYear":2024,"publisher":"Universität Münster","rawChecksum":"TcrJlnTQYfugD2rROs2Y3g==","rawType":"marc21","recordId":"MTA0MzQ4NDE4NDMwODAyNDg3NA","resourceType":null,"rights":"CC BY-NC-SA 4.0","source":"https://datastore.uni-muenster.de/oai","synonyms":[{"@type":"pair","data":"GoviB","name":"Alternative Title"}]} UPSERT WHERE recordId == 'MTA0MzQ4NDE4NDMwODAyNDg3NA';

As a workaround the projection can be renamed, ie:

SELECT name, publicationYear AS pubYear FROM metadata ORDER BY publicationYear;

but this also excludes the use of the index!

gramian avatar Oct 16 '24 13:10 gramian

Using either UPDATE ... MERGE ... UPSERT or UPDATE ... CONTENT ... UPSERT produces too many index results.

gramian avatar Oct 17 '24 08:10 gramian

Here is some more testing which may help locate the source:

  • Both sorting directions ASC and DESC produce these duplicate results but different numbers: 146 for ASC, and 117 for DESC for a database with 81 records.
  • This index behavior occurs also for INSERT INTO ... CONTENT ...
  • Using DISTINCT removes the duplicates but this is no workaround as the in paged results the duplicates may be spread.

gramian avatar Nov 29 '24 12:11 gramian

More testing reveals:

  • Using multiple order keys like ORDER BY publicationYear, name returns the correct amount of results. @lvca This does also not use the publicationYear index, right?
  • The data type seems not to make a difference (tested with SHORT, INTEGER, and LONG).

gramian avatar Dec 11 '24 10:12 gramian

@robfrank Here is another test case using 25.1.1. It is the smallest I could produce yet with 39 records: metadatalake-backup-20250130-122441030.zip

The query: SELECT name FROM metadata ORDER BY publicationYear produces correctly 39 results, while the query SELECT name, publicationYear FROM metadata ORDER BY publicationYear returns 50 results.

(Reminder: these duplicates appear when a projected indexed property is also order key, here publicationYear)

And another observation:

Using REBUILD INDEX * causes the query SELECT name, publicationYear FROM metadata ORDER BY publicationYear to return 40 results. So the rebuild still produces the duplicate results, yet lesser.

I hope this can help.

gramian avatar Jan 30 '25 11:01 gramian

Here is some more info that may help:

  • The records are UPSERTed (see example above) via the HTTP command endpoint using the SQLscript language.
  • A Unique index is running on the property recordId which is of type string with the attributes (mandatory true, notnull true, readonly true, max 31)
  • Wrapping the UPSERTs in an extra transaction does not make a difference.

gramian avatar Feb 12 '25 08:02 gramian

Using a subquery as a workaround, like:

SELECT FROM (SELECT name, publicationYear FROM metadata) ORDER BY publicationYear

returns the correct number of results and still uses the index as it seems.

gramian avatar Apr 09 '25 12:04 gramian

Here is a simple case reproducing the duplication in the index:

CREATE DOCUMENT TYPE doc;
CREATE PROPERTY doc.num LONG;
CREATE INDEX ON doc (num) NOTUNIQUE;

then:

INSERT INTO doc SET num = 1
INSERT INTO doc SET num = 2
INSERT INTO doc SET num = 2
INSERT INTO doc SET num = 2

shows via:

SELECT FROM index:`doc[num]`
key    rid
[1]    #1:0
[2]    #1:3
[2]    #1:2
[2]    #1:1
[2]    #1:3
[2]    #1:2
[2]    #1:3

Now:

select num from doc where num > 0

returns all entries in the index, whereas:

select num from doc

returns only the actual records.

gramian avatar Sep 25 '25 13:09 gramian

Wrapping the inserts in a transaction:

BEGIN;
INSERT INTO doc SET num = 1;
INSERT INTO doc SET num = 2;
INSERT INTO doc SET num = 2;
INSERT INTO doc SET num = 2;
COMMIT;

prevents these duplicate entries in the example above.

gramian avatar Sep 28 '25 20:09 gramian

Another test to reproduce the issue by using the console (by @gramian):

 @Test
  public void testDuplicateEntries() throws IOException {
    assertThat(console.parse("create database duptest")).isTrue();

    assertThat(console.parse("CREATE DOCUMENT TYPE doc")).isTrue();
    assertThat(console.parse("CREATE PROPERTY doc.num LONG")).isTrue();
    assertThat(console.parse("CREATE INDEX ON doc (num) NOTUNIQUE")).isTrue();

    assertThat(console.parse("INSERT INTO doc SET num = 1")).isTrue();
    assertThat(console.parse("INSERT INTO doc SET num = 2")).isTrue();
    assertThat(console.parse("INSERT INTO doc SET num = 2")).isTrue();
    assertThat(console.parse("INSERT INTO doc SET num = 2")).isTrue();

    assertThat(console.getDatabase().query("sql", "SELECT count(*) AS count FROM index:`doc[num]`").next().<Long>getProperty("count")).isEqualTo(4);

  }

lvca avatar Sep 29 '25 20:09 lvca

Works! Very Cool. Thank you very much @lvca

gramian avatar Sep 30 '25 06:09 gramian