horaedb icon indicating copy to clipboard operation
horaedb copied to clipboard

Deprecate custom UDF time_bucket in favor of date_bin

Open jiacai2050 opened this issue 3 years ago • 7 comments

Describe This Problem

In early version of CeresDB, there is a custom UDF time_bucket to align time of values, but now datafusion support it out of box, so I suggest deprecate our time_bucket, and use data_bin instead. Ex:

SELECT
      DATE_BIN(INTERVAL '15' minute, time, TIMESTAMP '2001-01-01T00:00:00Z') AS time,
      val
    FROM (
      VALUES
        (TIMESTAMP '2021-06-10 17:05:00Z', 0.5),
        (TIMESTAMP '2021-06-10 17:19:10Z', 0.3)
      ) as t (time, val)


        "+---------------------+-----+",
        "| time                | val |",
        "+---------------------+-----+",
        "| 2021-06-10 17:00:00 | 0.5 |",
        "| 2021-06-10 17:15:00 | 0.3 |",
        "+---------------------+-----+",

Proposal

  1. Completely remove it before 1.0 is out.
  2. If cannot remove it, add a deprecation warning.

Additional Context

https://github.com/apache/arrow-datafusion/issues/3015

jiacai2050 avatar Jan 11 '23 02:01 jiacai2050

I wonder whether time_bucket can be implemented by data_bin. And I prefer to keep the time_bucket UDF if it can, because time_bucket has been provided for use in some scenarios.

ShiKaiWi avatar Jan 11 '23 02:01 ShiKaiWi

I think yes, those two functions are almost the same, only with difference in how to set time duration

  • date_bin: INTERVAL '15' minute
  • time_bucket: PT15M

jiacai2050 avatar Jan 11 '23 02:01 jiacai2050

Anyone interested in this issue can try reimplement time_bucket based on date_bin

jiacai2050 avatar Jan 11 '23 04:01 jiacai2050

Found one bug related with time_bucket: https://github.com/CeresDB/ceresdb/issues/563

jiacai2050 avatar Jan 12 '23 07:01 jiacai2050

assigne me.

dust1 avatar Mar 14 '23 13:03 dust1

date_bin seems to have less functionality than time_bucket, and there is a big difference between the two. Do I just need to keep date_bin's date_bin(stride, source, origin) syntax?

                ScalarFunction::DateBin => Ok(date_bin(
                    parse_expr(&args[0], registry)?,
                    parse_expr(&args[1], registry)?,
                    parse_expr(&args[2], registry)?,
                )),

dust1 avatar Mar 20 '23 13:03 dust1

Yes, in fact only first two args is used in time_bucket.

fn new_udf() -> ScalarUdf {
    // args:
    // - timestamp column.
    // - period.
    // - input timestamp format in PARTITION BY (unsed now).
    // - input timezone (ignored now).
    // - timestamp output format (ignored now).
    let func = |args: &[ColumnarValue]| {
        let bucket = TimeBucket::parse_args(args)
            .box_err()
            .context(InvalidArguments)?;

        let result_column = bucket.call().box_err().context(CallFunction)?;

        Ok(ColumnarValue::Array(result_column))
    };

jiacai2050 avatar Mar 20 '23 14:03 jiacai2050