datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

`stride` is not optional for new `array_slice` UDF

Open Michael-J-Ward opened this issue 1 year ago • 6 comments

Describe the bug

The array_slice UDF takes 4 parameters.

https://github.com/apache/datafusion/blob/96487ea0cbb7901a1e4aa18fdf6deb8961319fea/datafusion/functions-array/src/extract.rs#L55-L61

Which means that args.len() is always 4 in array_slice_inner, even when called with stride = Expr::Null or stride = Expr::Int64(None)

https://github.com/apache/datafusion/blob/96487ea0cbb7901a1e4aa18fdf6deb8961319fea/datafusion/functions-array/src/extract.rs#L289-L293

To Reproduce

I encountered this while creating the python wrapper for datafusion-python and had to set stride=1 else the function would panic.

https://github.com/apache/datafusion-python/blob/7ad526caf140f7eb76ec541c903027d49693b58f/src/functions.rs#L104-L111

Expected behavior

The stride behavior in the UDF should match what's available in the CLI

CREATE TEMPORARY VIEW data3 AS VALUES ([1.0, 2.0, 3.0, 3.0]), ([4.0, 5.0, 3.0]), ([6.0]);
❯ select * from data3;
+----------------------+
| column1              |
+----------------------+
| [1.0, 2.0, 3.0, 3.0] |
| [4.0, 5.0, 3.0]      |
| [6.0]                |
+----------------------+

> select * from array_slice(data3.column1, -1, 2)
+-----------------------------------------------+
| array_slice(data3.column1,Int64(-1),Int64(2)) |
+-----------------------------------------------+
| []                                            |
| []                                            |
| [6.0]                                         |
+-----------------------------------------------+

Additional context

No response

Michael-J-Ward avatar May 08 '24 16:05 Michael-J-Ward

Thanks @Michael-J-Ward -- what is your ideal outcome? That the UDF function can take three arguments (and default the fourth value to a constant 1)?

alamb avatar May 08 '24 17:05 alamb

Looks like these were last modified in https://github.com/apache/datafusion/pull/9788 / https://github.com/apache/datafusion/pull/9615 from @jayzhan211 and @erenavsarogullari

https://github.com/apache/datafusion/blame/0d283a49f52d91a58c4e88e3e5e76fe0d59c1260/datafusion/functions-array/src/extract.rs#L314

alamb avatar May 08 '24 17:05 alamb

what is your ideal outcome? That the UDF function can take three arguments (and default the fourth value to a constant 1)?

I would expect that there exists some value for stride = Expr (likely Expr::Null or Expr::Int64(None)) such that the behavior of calling the UDF w/ that value matches the behavior of calling the SQL api without it

Roughly:

# rust
expr_fn::array_slice(array, begin, end, stride)
   
   ==

# sql   
`select * from array_slice(array, begin, end)`.

That way datafusion-python can map stride to that value when a user calls f.array_slice(col, literal(2), literal(4), stride=None)

EDIT: We would think stride = 1 should be that.

The frustration encountered with #10425 is that we were unable to trigger the code-path that the SQL API uses from the UDF.

Michael-J-Ward avatar May 08 '24 17:05 Michael-J-Ward

I agree it is helpful if we support expr_fn::array_slice(array, begin, end). We can't only provide three args for it now

But I can't think of any good solution to this

What you are expecting is something not supported in rust -- variadic function

fn array_slice(array:Expr, begin:Expr, end:Expr) {
    // with stride 1
}

fn array_slice(array:Expr, begin:Expr, end:Expr, stride:Expr) {
    // with stride
}

I think the issue here should be handled in datafusion-python

I encountered this while creating the python wrapper for datafusion-python and had to set stride=1 else the function would panic.

@Michael-J-Ward Did you successfully create expression without stride=1 before 38? It would be surprising if it works previously.

jayzhan211 avatar May 09 '24 02:05 jayzhan211

Is it possible to make the array_slice function accept a vector as its parameter?

fn array_slice(args: Vec<Expr>) {
}

When making its UDF function, we need to do

make_udf_function!(
    ArraySlice,
    array_slice,
    "returns a slice of the array.",
    array_slice_udf
);

jonahgao avatar May 09 '24 07:05 jonahgao

Another potential option is to add a second expr fn with a different signature

fn array_slice_with_stride(array:Expr, begin:Expr, end:Expr, stride:Expr) {
...
}

alamb avatar May 09 '24 10:05 alamb

Is it possible to make the array_slice function accept a vector as its parameter?

fn array_slice(args: Vec<Expr>) {
}

When making its UDF function, we need to do

make_udf_function!(
    ArraySlice,
    array_slice,
    "returns a slice of the array.",
    array_slice_udf
);

The downside of this is we loss the clear argument name and we need to check args length.

fn array_slice_with_stride(array:Expr, begin:Expr, end:Expr, stride:Expr) { ... }

I prefer this

jayzhan211 avatar May 10 '24 05:05 jayzhan211

First, I'd like to emphasize that if passing stride=1 to the UDF worked the same as not providing a stride in the SQL api, then this would be just porcelain / a nicety. Reposting the example from: https://github.com/apache/datafusion/issues/10425

CREATE TEMPORARY VIEW data3 AS VALUES ([1.0, 2.0, 3.0, 3.0]), ([4.0, 5.0, 3.0]), ([6.0]);
❯ select * from data3;
+----------------------+
| column1              |
+----------------------+
| [1.0, 2.0, 3.0, 3.0] |
| [4.0, 5.0, 3.0]      |
| [6.0]                |
+----------------------+
❯ select array_slice(column1, -1, 2) from data3;
+-----------------------------------------------+
| array_slice(data3.column1,Int64(-1),Int64(2)) |
+-----------------------------------------------+
| []                                            |
| []                                            |
| [6.0]                                         |
+-----------------------------------------------+
❯ select array_slice(column1, -1, 2, 1) from data3;

thread 'main' panicked at /Users/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-51.0.0/src/transform/primitive.rs:31:43:
range end index 9 out of range for slice of length 8

Now to this issue, I agree that I like the explicit arguments versus passing a Vec, and I would hope the solution doesn't require keeping two separate UDFs.

I created a draft with what I would think is the natural rust solution in #10450. I don't know if this violates some datafusion constraint, but all the tests pass.

#[doc = "returns a slice of the array."]
pub fn array_slice(array: Expr, begin: Expr, end: Expr, stride: Option<Expr>) -> Expr {
    if let Some(stride) = stride {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end, stride],
        ))
    } else {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end],
        ))
    }
}

If that is not possible, then a second proposal would be to amend array_slice_inner w/ some null check.

    let stride = if args_len == 4 & is_not_null(&args[3]) {
        Some(as_int64_array(&args[3])?)
    } else {
        None
    };

But again, this is just a nice-to-have if passing stride=1 worked the same as not providing it in the SQL API.

Michael-J-Ward avatar May 10 '24 19:05 Michael-J-Ward

I believe we encountered the same issue again with regex_replace, so I suspect this issue isn't constrained to array_slice and instead applies to any UDF that previously had an optional argument.

Michael-J-Ward avatar May 10 '24 19:05 Michael-J-Ward

I recall hitting something like this with the substr/substring function. One would think they would be identical however they were not (since rust doesn't do variadic functions nor does it allow defaults for args ala scala or Kotlin). For the substr it was expanded to two functions that internally used a vec<> and handled the logic of using the optional 3rd argument there.

    #[doc = "substring from the `position` to the end"]
    pub fn substr(string: Expr, position: Expr) -> Expr {
        super::substr().call(vec![string, position])
    }

    #[doc = "substring from the `position` with `length` characters"]
    pub fn substring(string: Expr, position: Expr, length: Expr) -> Expr {
        super::substr().call(vec![string, position, length])
    }
    ```

Omega359 avatar May 10 '24 20:05 Omega359

I am +1 for this solution suggested by @Michael-J-Ward

#[doc = "returns a slice of the array."]
pub fn array_slice(array: Expr, begin: Expr, end: Expr, stride: Option<Expr>) -> Expr {
    if let Some(stride) = stride {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end, stride],
        ))
    } else {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end],
        ))
    }
}

andygrove avatar May 10 '24 20:05 andygrove

I think there are two issues here, one is the panic issue in datafusion-cli and another one is the API design for Expr.

  1. array_slice panic
query ?
select array_slice([1, 2, 3], 1, 2, 1);
----
[1, 2]

query ?
select array_slice([1, 2, 3], -1, 2, 1);
----
PANIC!

The reason for panic in the second command is due to the bug in array_slice

https://github.com/apache/datafusion/blob/933b430839022aaba2c8d8d11d49b3a4ed7af361/datafusion/functions-array/src/extract.rs#L420-L425

We should fix this.

Another issue is that given the current API constraint of Expr, we can not provide only 3 arguments. There are many different approaches like giving vec, or introducing another function or your proposal

#[doc = "returns a slice of the array."]
pub fn array_slice(array: Expr, begin: Expr, end: Expr, stride: Option<Expr>) -> Expr {
    if let Some(stride) = stride {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end, stride],
        ))
    } else {
        Expr::ScalarFunction(ScalarFunction::new_udf(
            array_slice_udf(),
            vec![array, begin, end],
        ))
    }
}

I think this is more a user experience problem, how should we design it is discussable.

Given the design here, we still need to provide 4 args, where you provide None to stride if you expect stride with 1.

TLDR: The panic issue should be fixed not by redesigning the Expr API, but by fixing the internal array slice code.

jayzhan211 avatar May 11 '24 01:05 jayzhan211

Agreed @jayzhan211, these are two separate issues.

The panic issue was filed separately as #10425

Michael-J-Ward avatar May 11 '24 01:05 Michael-J-Ward

I think this is more a user experience problem, how should we design it is discussable.

It is more than a user experience issue. The current API is causing test failures in a downstream project (the DataFusion Python bindings) and this cannot be resolved without the API changes being discussed in this issue.

andygrove avatar May 11 '24 13:05 andygrove

Since the regexp_* UDFs have the same problem, I suspect that array_slice was just our first encounter w/ the underlying issue: any "inner" UDF implementation that behaves differently based on the number of arguments passed should expose that via its public API.

Solutions discussed:

  1. Expose multiple public apis for each variant, as @Omega359 pointed out that inner substr exposes public substr and substring.

  2. Change the public APIs to take Option<Expr>, as I demoed for array_slice in #10450.

  3. Having the public API just take a Vec<Expr>

Michael-J-Ward avatar May 11 '24 17:05 Michael-J-Ward

The current API is causing test failures in a downstream project (the DataFusion Python bindings) and this cannot be resolved without the API changes being discussed in this issue.

I believe the failure occurred because of the bug I mentioned. Can you provide another example where df-python still experiences a panic?

#[pyfunction]
#[pyo3(signature = (array, begin, end, stride = 1))]
fn array_slice(array: PyExpr, begin: PyExpr, end: PyExpr, stride: Option<i64>) -> PyExpr {
    let stride = ScalarValue::Int64(stride);
    let stride = Expr::Literal(stride);
    datafusion_functions_array::expr_fn::array_slice(array.into(), begin.into(), end.into(), stride)
        .into()
}

If it is because of the Expr design. Given these 3, I would prefer the second one.

jayzhan211 avatar May 11 '24 23:05 jayzhan211

I submitted a formal PR #10469 to address this.

jonahgao avatar May 12 '24 14:05 jonahgao