ksql icon indicating copy to clipboard operation
ksql copied to clipboard

Clearer behavior when using UDAFs with *

Open reneesoika opened this issue 2 years ago • 3 comments

Currently, using * in a UDAF defaults to the ROWTIME column. For example, if you run:

CREATE STREAM test (ID STRING) WITH (KAFKA_TOPIC='test', PARTITIONS=1, VALUE_FORMAT='JSON');
CREATE TABLE test_result AS SELECT ID, MAX(*) FROM test GROUP BY ID;

INSERT INTO test VALUES ('hello');
INSERT INTO test VALUES ('hello');
INSERT INTO test VALUES ('hello');
INSERT INTO test VALUES ('test');

Sample result:

SELECT * FROM TEST_RESULT;
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
|ID                                                                                  |KSQL_COL_0                                                                          |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
|hello                                                                               |1658336055224                                                                       |
|test                                                                                |1658336060177                                                                       |

If you try to use a UDAF that does not support longs, ksql fails to resolve the function. This error message might be confusing to users if they don't realize that it's using ROWTIME automatically. I deleted the avg function for longs for the sake of testing:

CREATE TABLE test_result AS SELECT ID, AVG(*) FROM test GROUP BY ID;
Could not determine output schema for query due to error: Function 'avg' does not accept parameters (BIGINT).
  • Is this behavior guaranteed to users, or is this undefined behavior?
  • Should this behavior be documented somewhere?
  • Should * be disallowed for most functions? This is what Postgres and MySQL do.

cc: @jzaralim

reneesoika avatar Jul 20 '22 17:07 reneesoika

Piggybacking off of this issue - calling a UDAF with empty args does the same thing. AVG() takes the average of ROWTIME.

jzaralim avatar Jul 21 '22 20:07 jzaralim

@JimGalasyn , can we update the documentation with this explicit behavior for now ?

suhas-satish avatar Jul 26 '22 16:07 suhas-satish

I think we should at least figure out how we want * to behave in the future. Can we agree on the following fixes in the longer run?

  1. * should be disallowed for most functions.
  2. Empty args should be disallowed in all functions. None of our existing UDAFs are documented to support empty args. Functions like COUNT must be called with a *.
  3. * should refer to the entire record (very long run)

jzaralim avatar Jul 26 '22 17:07 jzaralim