SQL: Index can return too many entries when used with `ORDER BY`
ArcadeDB Version:
ArcadeDB Server v24.11.1-SNAPSHOT (build 4207821d055e37cbd74b76683e45073da759c2de/1729083296142/console)
OS and JDK Version:
Running on Mac OS X 12.7.6 - OpenJDK 64-Bit Server VM 17.0.12 (Homebrew)
Querying a database can return the wrong number of records, in particular too many records. This problem is not easy to reproduce so I attach a small database (75 records) generated from public data. This behavior occurs when from certain databases, such as the one provided below, is queried via SQL, ie SELECT, where an (not-unique) indexed property is used in the projection and also as ordering quantity (in ORDER BY). The schema is given here: https://github.com/ulbmuenster/dataasee/blob/main/database/schema.sql
Expected behavior
Correct number of records returned.
Actual behavior
Too many records are returned.
Steps to reproduce
Restore this backup: metadatalake-backup-20240829-100753793.zip (337KB)
SELECT count(*) FROM metadata; -- There are 75 records in the database
SELECT name, publicationYear FROM metadata; -- Returns 75 records correctly
SELECT name FROM metadata ORDER BY publicationYear; -- Returns 75 records correctly
SELECT name, publicationYear FROM metadata ORDER BY publicationYear; -- Returns 117 records INCORRECTLY
Notes
End of August (2024-08-28 -- 2024-08-31) @lvca (and I) looked into this and potential sources of this behavior could be:
FetchFromIndexStep- The
UPSERTentering the data.
Here is a sample UPSERT statement that generated the provided database:
UPDATE metadata MERGE {"creators":[{"@type":"pair","name":"Neumann, Georg"},{"@type":"pair","name":"Boivin, Odette"},{"@type":"pair","name":"Kleber, Kristin"},{"@type":"pair","name":"Neumann, Georg"},{"@type":"pair","name":"Boivin, Odette"},{"@type":"pair","name":"Kleber, Kristin"}],"dataLocation":"https://data-management.uni-muenster.de/direct-access/wwurdm/07988641231","dataSteward":"https://datastore.uni-muenster.de","description":"The dataset comprises the RTI-data of the cuneiform tablet VAT 17966 (N5:27).\\n\\nRTI = Reflectance Transformation Imaging (RTI) is a computational photographic method that captures a subject’s surface shape and color and enables the interactive re-lighting of the subject from any direction in a software viewer, revealing details not visible with the naked eye (cf. https://culturalheritageimaging.org/).","identifiers":[{"@type":"pair","data":"10.17879/07988641225","name":"DE-6"},{"@type":"pair","data":"10.17879/07988641225","name":"null"}],"keywords":"Babylon,Cuneiform,Neo-Babylonian","language":"#4:0","metadataQuality":"Incomplete","name":"VAT 17966 - RTI dataset","publicationYear":2024,"publisher":"Universität Münster","rawChecksum":"TcrJlnTQYfugD2rROs2Y3g==","rawType":"marc21","recordId":"MTA0MzQ4NDE4NDMwODAyNDg3NA","resourceType":null,"rights":"CC BY-NC-SA 4.0","source":"https://datastore.uni-muenster.de/oai","synonyms":[{"@type":"pair","data":"GoviB","name":"Alternative Title"}]} UPSERT WHERE recordId == 'MTA0MzQ4NDE4NDMwODAyNDg3NA';
As a workaround the projection can be renamed, ie:
SELECT name, publicationYear AS pubYear FROM metadata ORDER BY publicationYear;
but this also excludes the use of the index!
Using either UPDATE ... MERGE ... UPSERT or UPDATE ... CONTENT ... UPSERT produces too many index results.
Here is some more testing which may help locate the source:
- Both sorting directions
ASCandDESCproduce these duplicate results but different numbers: 146 forASC, and 117 forDESCfor a database with 81 records. - This index behavior occurs also for
INSERT INTO ... CONTENT ... - Using
DISTINCTremoves the duplicates but this is no workaround as the in paged results the duplicates may be spread.
More testing reveals:
- Using multiple order keys like
ORDER BY publicationYear, namereturns the correct amount of results. @lvca This does also not use thepublicationYearindex, right? - The data type seems not to make a difference (tested with
SHORT,INTEGER, andLONG).
@robfrank Here is another test case using 25.1.1. It is the smallest I could produce yet with 39 records: metadatalake-backup-20250130-122441030.zip
The query: SELECT name FROM metadata ORDER BY publicationYear produces correctly 39 results, while the query SELECT name, publicationYear FROM metadata ORDER BY publicationYear returns 50 results.
(Reminder: these duplicates appear when a projected indexed property is also order key, here publicationYear)
And another observation:
Using REBUILD INDEX * causes the query SELECT name, publicationYear FROM metadata ORDER BY publicationYear to return 40 results. So the rebuild still produces the duplicate results, yet lesser.
I hope this can help.
Here is some more info that may help:
- The records are UPSERTed (see example above) via the HTTP
commandendpoint using theSQLscriptlanguage. - A
Uniqueindex is running on the propertyrecordIdwhich is of typestringwith the attributes(mandatory true, notnull true, readonly true, max 31) - Wrapping the UPSERTs in an extra transaction does not make a difference.
Using a subquery as a workaround, like:
SELECT FROM (SELECT name, publicationYear FROM metadata) ORDER BY publicationYear
returns the correct number of results and still uses the index as it seems.
Here is a simple case reproducing the duplication in the index:
CREATE DOCUMENT TYPE doc;
CREATE PROPERTY doc.num LONG;
CREATE INDEX ON doc (num) NOTUNIQUE;
then:
INSERT INTO doc SET num = 1
INSERT INTO doc SET num = 2
INSERT INTO doc SET num = 2
INSERT INTO doc SET num = 2
shows via:
SELECT FROM index:`doc[num]`
key rid
[1] #1:0
[2] #1:3
[2] #1:2
[2] #1:1
[2] #1:3
[2] #1:2
[2] #1:3
Now:
select num from doc where num > 0
returns all entries in the index, whereas:
select num from doc
returns only the actual records.
Wrapping the inserts in a transaction:
BEGIN;
INSERT INTO doc SET num = 1;
INSERT INTO doc SET num = 2;
INSERT INTO doc SET num = 2;
INSERT INTO doc SET num = 2;
COMMIT;
prevents these duplicate entries in the example above.
Another test to reproduce the issue by using the console (by @gramian):
@Test
public void testDuplicateEntries() throws IOException {
assertThat(console.parse("create database duptest")).isTrue();
assertThat(console.parse("CREATE DOCUMENT TYPE doc")).isTrue();
assertThat(console.parse("CREATE PROPERTY doc.num LONG")).isTrue();
assertThat(console.parse("CREATE INDEX ON doc (num) NOTUNIQUE")).isTrue();
assertThat(console.parse("INSERT INTO doc SET num = 1")).isTrue();
assertThat(console.parse("INSERT INTO doc SET num = 2")).isTrue();
assertThat(console.parse("INSERT INTO doc SET num = 2")).isTrue();
assertThat(console.parse("INSERT INTO doc SET num = 2")).isTrue();
assertThat(console.getDatabase().query("sql", "SELECT count(*) AS count FROM index:`doc[num]`").next().<Long>getProperty("count")).isEqualTo(4);
}
Works! Very Cool. Thank you very much @lvca