dolt icon indicating copy to clipboard operation
dolt copied to clipboard

`IN (SUBQUERY)` expressions ignore collation

Open nicktobey opened this issue 8 months ago • 0 comments

Steps to reproduce:

-- create
CREATE TABLE test (
  pk INTEGER PRIMARY KEY,
  name TEXT NOT NULL
) COLLATE=utf8mb4_0900_ai_ci;

-- insert
INSERT INTO test VALUES (0001, 'Clark');
INSERT INTO test VALUES (0002, 'Dave');
INSERT INTO test VALUES (0003, 'Ava');

-- fetch 
SELECT 'clark' IN (SELECT name FROM test);

Expected output: true Actual output: false

Because the table uses a case-insensitive collation, the values 'Clark' and 'clark' should compare equal, and the expression should return true.

Here's the plan for the query:

+---------------------------------+
| plan                            |
+---------------------------------+
| Project                         |
|  ├─ columns: [InSubquery        |
|  │   ├─ left: 'clark'           |
|  │   └─ right: Subquery         |
|  │       ├─ cacheable: true     |
|  │       └─ Table               |
|  │           ├─ name: test      |
|  │           └─ columns: [name] |
|  │  ]                           |
|  └─ Project                     |
|      ├─ columns: [dual.]        |
|      └─ Table                   |
|          └─ name:               |
+---------------------------------+

The implementation for the InSubquery expression uses a hash table to check for membership, but this requires normalizing the keys based on the table collation. Since we don't appear to do that normalization, we get different hashes for the two strings.

nicktobey avatar Mar 28 '25 21:03 nicktobey