datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Bug detecting datatype in VALUES tuples

Open davisp opened this issue 1 year ago • 2 comments

Describe the bug

When executing an INSERT INTO table VALUES (value1), (value2); statement, the auto type detection on the plan created for the values tuples breaks valid inputs due to how the data type detection works.

I'm pretty sure the bug is here:

https://github.com/apache/datafusion/blob/b2ac83ff821b8434cfcc9b3391d7039120c7b259/datafusion/expr/src/logical_plan/builder.rs#L187-L219

This logic coerces all value tuples to cast to the type of the value in the first row. In the case of an insert into a UInt64 column, this can end up needlessly attempting to cast a valid UInt64 to a Int64 before being casted back to a UInt64 during plan execution.

To Reproduce

Can be verified easily via cli:

> CREATE TABLE test(a BIGINT UNSIGNED);
0 row(s) fetched.
Elapsed 0.026 seconds.

> INSERT INTO test(a) VALUES (7378031881458185744), (12775823699315690233);
Arrow error: Cast error: Can't cast value 12775823699315690233 to type Int64

Expected behavior

I would have expected two rows to be inserted rather than receiving an error.

Additional context

No response

davisp avatar Aug 21 '24 21:08 davisp

This logic coerces all value tuples to cast to the type of the value in the first row.

it would be better to coerce all values to their common super type (order insensitive, nor biased to the first row)

In the case of an insert into a UInt64 column, this can end up needlessly attempting to cast a valid UInt64 to a Int64 before being casted back to a UInt64 during plan execution.

That's a good point. In INSERT's case one could want each value to independently get coerced to the target type. However, this would be INSERT-specific special behavior which could lead to hard-to-understand inconsistencies later. INSERT INTO <table> VALUES ... is just a case of INSERT INTO <table <select-query>. INSERT INTO <table> VALUES ... should behave exactly same as INSERT INTO <table> SELECT * FROM (VALUES ...) or INSERT INTO <table> SELECT * FROM (VALUES ...) WHERE true. In those more elaborate cases it's obvious that VALUES need to coerce to their common type.

findepi avatar Aug 23 '24 20:08 findepi

I agree with @findepi that a more general solution to this problem is to extend the VALUES coercion to pick the widest super type that works for all values in a column

While using the available types for insert does seem a reasonable idea, I think following the pattern of other coercions would be better

alamb avatar Aug 27 '24 13:08 alamb