datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

coercion rule about `eq` and InList between string type and numeric type

Open liukun4515 opened this issue 3 years ago • 8 comments

I have some concerns about the rule between string and number. I check some situation in the spark:

spark-sql> desc t3;
c1                      int

spark-sql> explain extended select * from t3 where c1 = cast(123.123 as string);
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('c1 = cast(123.123 as string))
   +- 'UnresolvedRelation [t3], [], false

== Analyzed Logical Plan ==
c1: int
Project [c1#186]
+- Filter (c1#186 = cast(cast(123.123 as string) as int))
   +- SubqueryAlias spark_catalog.default.t3
      +- HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]

== Optimized Logical Plan ==
Filter (isnotnull(c1#186) AND (c1#186 = 123))
+- HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]

== Physical Plan ==
*(1) Filter (isnotnull(c1#186) AND (c1#186 = 123))
+- Scan hive default.t3 [c1#186], HiveTableRelation [`default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#186], Partition Cols: []]

In the previous case, the result of coercion is Int. I think we need to create an issue to track this. @viirya @alamb

Originally posted by @liukun4515 in https://github.com/apache/arrow-datafusion/pull/2794#discussion_r907011527

liukun4515 avatar Jun 27 '22 06:06 liukun4515

cc @viirya

liukun4515 avatar Jun 27 '22 06:06 liukun4515

I'm going to implement casting between UTF8 and decimal, we can also consider these cases together.

liukun4515 avatar Jun 27 '22 07:06 liukun4515

I wonder if the question here is "should we automatically try and coerce numbers to strings (which is more general but slower) or coerce strings to numbers (which is less general but faster)"?

alamb avatar Jun 27 '22 17:06 alamb

I wonder if the question here is "should we automatically try and coerce numbers to strings (which is more general but slower) or coerce strings to numbers (which is less general but faster)"?

yes, we should try some other database system and discuss the best behavior for these case

liukun4515 avatar Jun 28 '22 03:06 liukun4515

Here is what postgres does (appears to me to cast based on the type of the first element in the IN list):

alamb=# select 5 in (1, 2, 'fff');
ERROR:  invalid input syntax for type integer: "fff"
LINE 1: select 5 in (1, 2, 'fff');
                           ^
alamb=# select 'foo' in (1, 2, 'fff');
ERROR:  invalid input syntax for type integer: "fff"
LINE 1: select 'foo' in (1, 2, 'fff');
                               ^
alamb=# select 'foo' in ('1', '2', 'fff');
 ?column? 
----------
 f
(1 row)

alamb avatar Jun 28 '22 18:06 alamb

Here is what postgres does (appears to me to cast based on the type of the first element in the IN list)

This sounds reasonable.

viirya avatar Jun 28 '22 19:06 viirya

Here is what postgres does (appears to me to cast based on the type of the first element in the IN list):

alamb=# select 5 in (1, 2, 'fff');
ERROR:  invalid input syntax for type integer: "fff"
LINE 1: select 5 in (1, 2, 'fff');
                           ^
alamb=# select 'foo' in (1, 2, 'fff');
ERROR:  invalid input syntax for type integer: "fff"
LINE 1: select 'foo' in (1, 2, 'fff');
                               ^
alamb=# select 'foo' in ('1', '2', 'fff');
 ?column? 
----------
 f
(1 row)

Do you try this?

select 1 in (12,1,3,'123');

liukun4515 avatar Jul 04 '22 13:07 liukun4515

Do you try this?

psql (14.3)
Type "help" for help.

alamb=# select 1 in (12,1,3,'123');
 ?column? 
----------
 t
(1 row)

alamb avatar Jul 05 '22 16:07 alamb