datafusion
datafusion copied to clipboard
Bug detecting datatype in VALUES tuples
Describe the bug
When executing an INSERT INTO table VALUES (value1), (value2); statement, the auto type detection on the plan created for the values tuples breaks valid inputs due to how the data type detection works.
I'm pretty sure the bug is here:
https://github.com/apache/datafusion/blob/b2ac83ff821b8434cfcc9b3391d7039120c7b259/datafusion/expr/src/logical_plan/builder.rs#L187-L219
This logic coerces all value tuples to cast to the type of the value in the first row. In the case of an insert into a UInt64 column, this can end up needlessly attempting to cast a valid UInt64 to a Int64 before being casted back to a UInt64 during plan execution.
To Reproduce
Can be verified easily via cli:
> CREATE TABLE test(a BIGINT UNSIGNED);
0 row(s) fetched.
Elapsed 0.026 seconds.
> INSERT INTO test(a) VALUES (7378031881458185744), (12775823699315690233);
Arrow error: Cast error: Can't cast value 12775823699315690233 to type Int64
Expected behavior
I would have expected two rows to be inserted rather than receiving an error.
Additional context
No response
This logic coerces all value tuples to cast to the type of the value in the first row.
it would be better to coerce all values to their common super type (order insensitive, nor biased to the first row)
In the case of an insert into a UInt64 column, this can end up needlessly attempting to cast a valid UInt64 to a Int64 before being casted back to a UInt64 during plan execution.
That's a good point. In INSERT's case one could want each value to independently get coerced to the target type.
However, this would be INSERT-specific special behavior which could lead to hard-to-understand inconsistencies later.
INSERT INTO <table> VALUES ... is just a case of INSERT INTO <table <select-query>.
INSERT INTO <table> VALUES ... should behave exactly same as INSERT INTO <table> SELECT * FROM (VALUES ...) or INSERT INTO <table> SELECT * FROM (VALUES ...) WHERE true. In those more elaborate cases it's obvious that VALUES need to coerce to their common type.
I agree with @findepi that a more general solution to this problem is to extend the VALUES coercion to pick the widest super type that works for all values in a column
While using the available types for insert does seem a reasonable idea, I think following the pattern of other coercions would be better