graphql-engine
graphql-engine copied to clipboard
MSSQL: VARCHAR columns compared against NVARCHAR strings.
Version Information
Server Version: 2.8.4
Environment
CLOUD / OSS
What is the current behaviour?
When comparing a string against a VARCHAR column, the string is formatted as a NVARCHAR string.
WHERE column = N'value'
What is the expected behaviour?
Strings should be compared with strings of the same type.
WHERE column = 'value'
How to reproduce the issue?
See full repro
thanks Benoit, and for the linked detail. We'll triage this in the next week or so
Update: we've confirmed that this issue does result in different execution plans.
SQL we are comparing:
Version 1: compare varchar column to varchar string
SELECT * FROM [test_varchar]
WHERE [test_varchar].[id] = 'varchar'
Version 2: compare varchar column to nvarchar string
SELECT * FROM [test_varchar]
WHERE [test_varchar].[id] = N'varchar'
The resulting plans indicate the second variant includes an implicit conversion step, and a "Clustered Index Scan" rather than a "Clustered Index Seek"
Version 1
StmtText
----------------------------------------------------------------------
SELECT * FROM [test_varchar]
WHERE [test_varchar].[id] = 'value'
(1 rows affected)
StmtText
-------------------------------------------------------------------------------------------------------------------------------------------------------
|--Clustered Index Seek(OBJECT:([mssql1].[dbo].[test_varchar].[test_varchar_index]), SEEK:([mssql1].[dbo].[test_varchar].[id]=[@1]) ORDERED FORWARD)
(1 rows affected)
Version 2
StmtText
------------------------------------------------------------------------
SELECT * FROM [test_varchar]
WHERE [test_varchar].[id] = N'value'
(1 rows affected)
StmtText
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|--Clustered Index Scan(OBJECT:([mssql1].[dbo].[test_varchar].[test_varchar_index]), WHERE:(CONVERT_IMPLICIT(nvarchar(32),[mssql1].[dbo].[test_varchar].[id],0)=[@1]))
(1 rows affected)
This behaviour comes from the ODBC library. NVARCHAR is 2 bytes per character (so supports Unicode), but VARCHAR is one 1 byte.
Currently the library encodes all text values as NVARCHAR:
https://github.com/fpco/odbc/blob/38e04349fe28a91f189e44bd7783220956d18aae/src/Database/ODBC/SQLServer.hs#L542
I suppose the fix would be to make a change to the library to add another constructor for SimpleTextValue or similar, along with a newtype wrapper for Text that makes us use it. Then when we are using literals we can ensure they match the type they are being compared with.
thanks all. Suggesting we timebox investigation to 1 day
i think we need to make sure that, if the column has the type varchar, and the string literal only contains in the string literal are ascii characters, before we are able to convert the string literal we get from the query (which afaik is encoded in utf-8) to varchar.
Another approach that @plcplc proposed was to cast the nvarchar literal to varchar. If MSSQL would throw an error in such a case I think that would be a good option, but I'm worried mssql might silently convert utf-8 to ascii even if it can't such that we could get wrong results. I think this option needs a little bit of research!
I think the reason we're seeing that the index isn't used is that there is an implicit conversion "VARCHAR -> NVARCHAR", which means that it considers the expression [test_varchar].[id] = N'varchar' as though it were CAST([test_varchar].[id] AS NVARCHAR) = N'varchar', which cannot exploit an index on the id column.
For some background, depending on the collation used, a varchar is able to encode different types of text. As a consequence, any given varchar might be an ASCII-encoded string, UTF8 encoded string, or even e.g. japanese text in a legacy encoding.
So it's unfortunately too simplistic to equate varchar and ASCII. Although I would think that those non-ascii-non-unicode collations are quite rare in practise.
Now in order to justify my own suggestion I made some small experiments in a terminal to check my understanding, and the results are disheartening:
Experiment. Click to expand.
create table foo (test varchar(10));
insert into foo values (N'?'), (N'∡'), (N'p'), (N'π'), (N'8'), (N'∞')
(6 rows affected)
select * from foo;
test ? ? p p 8 8
select (N'?'), (CAST(N'∡' AS VARCHAR)), (N'p'), (CAST(N'π' AS VARCHAR)), (N'8'), (CAST(N'∞' AS VARCHAR))
? ? p p 8 8
This seems to suggest that the casting to varchar can indeed produce very strange results.
All in all I think I'm in favor of the approach to not put N'' quotes on varchar literals iff the provided input is all ascii.
A fix has now been merged that will use a VARCHAR or CHAR literal instead of NVARCHAR / NCHAR where the input is all ASCII characters. If the input contains non-ASCII characters then we use the N'' quotes as before and let MSSQL do the casting as before.