horaedb Deprecate custom UDF time_bucket in favor of date

Describe This Problem

In early version of CeresDB, there is a custom UDF time_bucket to align time of values, but now datafusion support it out of box, so I suggest deprecate our time_bucket, and use data_bin instead. Ex:

SELECT
      DATE_BIN(INTERVAL '15' minute, time, TIMESTAMP '2001-01-01T00:00:00Z') AS time,
      val
    FROM (
      VALUES
        (TIMESTAMP '2021-06-10 17:05:00Z', 0.5),
        (TIMESTAMP '2021-06-10 17:19:10Z', 0.3)
      ) as t (time, val)


        "+---------------------+-----+",
        "| time                | val |",
        "+---------------------+-----+",
        "| 2021-06-10 17:00:00 | 0.5 |",
        "| 2021-06-10 17:15:00 | 0.3 |",
        "+---------------------+-----+",

Proposal

Completely remove it before 1.0 is out.
If cannot remove it, add a deprecation warning.

Additional Context

https://github.com/apache/arrow-datafusion/issues/3015

Jan 11 '23 02:01 jiacai2050

I wonder whether time_bucket can be implemented by data_bin. And I prefer to keep the time_bucket UDF if it can, because time_bucket has been provided for use in some scenarios.

Jan 11 '23 02:01 ShiKaiWi

I think yes, those two functions are almost the same, only with difference in how to set time duration

date_bin: INTERVAL '15' minute
time_bucket: PT15M

Jan 11 '23 02:01 jiacai2050

Anyone interested in this issue can try reimplement time_bucket based on date_bin

Jan 11 '23 04:01 jiacai2050

Found one bug related with time_bucket: https://github.com/CeresDB/ceresdb/issues/563

Jan 12 '23 07:01 jiacai2050

assigne me.

Mar 14 '23 13:03 dust1

date_bin seems to have less functionality than time_bucket, and there is a big difference between the two. Do I just need to keep date_bin's date_bin(stride, source, origin) syntax?

                ScalarFunction::DateBin => Ok(date_bin(
                    parse_expr(&args[0], registry)?,
                    parse_expr(&args[1], registry)?,
                    parse_expr(&args[2], registry)?,
                )),

Mar 20 '23 13:03 dust1

Yes, in fact only first two args is used in time_bucket.

fn new_udf() -> ScalarUdf {
    // args:
    // - timestamp column.
    // - period.
    // - input timestamp format in PARTITION BY (unsed now).
    // - input timezone (ignored now).
    // - timestamp output format (ignored now).
    let func = |args: &[ColumnarValue]| {
        let bucket = TimeBucket::parse_args(args)
            .box_err()
            .context(InvalidArguments)?;

        let result_column = bucket.call().box_err().context(CallFunction)?;

        Ok(ColumnarValue::Array(result_column))
    };

Mar 20 '23 14:03 jiacai2050

Deprecate custom UDF time_bucket in favor of date_bin

Describe This Problem

Proposal

Additional Context