query performance degraded when using "where <int> in (<string>)"
What type of bug is this?
Performance issue
What subsystems are affected?
Standalone mode
Minimal reproduce step
Seen in some private deployment, the query time can be drastically different:
if select * from foo where id in ('1') and ts > '2024-03-09 00:00:00' and ts < '2024-03-09 16:00:00', it costs 2.7 seconds; if select * from foo where id in (1) and ts > '2024-03-09 00:00:00' and ts < '2024-03-09 16:00:00', it costs 58 millis. Something unexpected must happen, we should digging deeper.
What did you expect to see?
Same query time, or return some error saying int can't "in" string.
What did you see instead?
Very different time costs.
What operating system did you use?
macos
What version of GreptimeDB did you use?
main
Relevant log output and stack trace
No response
I test similar cases in MySQL. It truncates it by default and yields a warning, which is a little confusing. I think we could convert the string literal in the planner phase and return an error if the string can't be converted to the target type.
mysql> select * from test_check where a > 'abc';
+---+
| a |
+---+
| 1 |
+---+
1 row in set, 1 warning (0.00 sec)
mysql> show warnings;
+---------+------+-----------------------------------------+
| Level | Code | Message |
+---------+------+-----------------------------------------+
| Warning | 1292 | Truncated incorrect DOUBLE value: 'abc' |
+---------+------+-----------------------------------------+
1 row in set (0.00 sec)
mysql> explain analyze select * from test_check where a > 'abc';
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (test_check.a > 0) (cost=0.35 rows=1) (actual time=0.0323..0.0366 rows=1 loops=1)
-> Covering index scan on test_check using PRIMARY (cost=0.35 rows=1) (actual time=0.03..0.0341 rows=1 loops=1)
|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)
PostgreSQL returns an error if it can't convert the string literal to an integer. We can follow this behavior.
postgres=# select * from test_int where a > 'abc';
ERROR: invalid input syntax for type integer: "abc"
LINE 1: select * from test_int where a > 'abc';
^
postgres=# select * from test_int where a > '1';
a
---
3
5
(2 rows)
postgres=# select * from test_int where a > '1ab';
ERROR: invalid input syntax for type integer: "1ab"
LINE 1: select * from test_int where a > '1ab';
^
postgres=# explain select * from test_int where a > '1';
QUERY PLAN
--------------------------------------------------------------------------------------
Index Only Scan using test_int_pkey on test_int (cost=0.15..27.13 rows=850 width=4)
Index Cond: (a > 1)
(2 rows)
postgres=# explain select * from test_int where a > '1ab';
ERROR: invalid input syntax for type integer: "1ab"
LINE 1: explain select * from test_int where a > '1ab';
@waynexia @evenyag
Are you working on this issue? If not, I'm interested in this issue.
PostgreSQL returns an error if it can't convert the string literal to an integer. We can follow this behavior.
My plan is to add the logic for checking whether the string literal can be converted to the column type in TypeConverter::convert_type.
https://github.com/GreptimeTeam/greptimedb/blob/69d9a2845f6328ca2c2d4396c11a79243a956506/src/query/src/optimizer/type_conversion.rs#L168-L203
@NiwakaDev No, I'm not working on this. Free free to have a try!