datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Check the udf output size which should be equal to the input size

Open doki23 opened this issue 2 years ago • 6 comments

          Hmmm...or we should check the result size of udf? I'm not sure wether it's proper that the sizes of input and result could be different. cc @alamb @mingmwang @tustvold

Originally posted by @doki23 in https://github.com/apache/arrow-datafusion/issues/5635#issuecomment-1475092781

doki23 avatar Mar 25 '23 12:03 doki23

@alamb @doki23 it seems that this issue is fixed?

zhenglin-charlie-li avatar Feb 15 '24 22:02 zhenglin-charlie-li

@alamb @doki23 it seems that this issue is fixed?

I'm not sure :(

doki23 avatar Feb 19 '24 03:02 doki23

I think the idea of this ticket was to put some basic checks / assert to ensure that the output of UDFs has the correct

As I understand it this would mean adding (or seeing if there was an assert) that the number of output rows from accumulators was correct

Maybe somewhere in

https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/aggregates/row_hash.rs

alamb avatar Feb 19 '24 06:02 alamb

Hi, i would like to take this issue

duongcongtoai avatar Apr 13 '24 14:04 duongcongtoai

So the idea here is that we add a check after invoking a ScalarUDF that the number of rows that came out was the same as the number that went in. If this is not the case DataFusion should raise an internal error with a clear error message

alamb avatar Apr 13 '24 20:04 alamb

This was implemented in this PR (and we fixed 2 existing UDF violating this constraint)

duongcongtoai avatar Apr 27 '24 05:04 duongcongtoai