r-polars icon indicating copy to clipboard operation
r-polars copied to clipboard

`NA` should be automatically cast to the right type when creating a `DataFrame` with list-columns

Open etiennebacher opened this issue 1 year ago • 20 comments

Very annoying to have to specify NA_integer_ or the other variants:

library(polars)
options(polars.do_not_repeat_call = TRUE)

pl$DataFrame(x = list(1L, 2L, NA_integer_))
#> shape: (3, 1)
#> ┌───────────┐
#> │ x         │
#> │ ---       │
#> │ list[i32] │
#> ╞═══════════╡
#> │ [1]       │
#> │ [2]       │
#> │ [null]    │
#> └───────────┘

# NA should be automatically cast to the correct type
pl$DataFrame(x = list(1L, 2L, NA))
#> Error: Execution halted with the following contexts
#>    0: In R: in $DataFrame():
#>    1: When constructing polars literal from Robj
#>    2: Encountered the following error in Rust-Polars:
#>          expected Series

etiennebacher avatar Feb 08 '24 17:02 etiennebacher

Smaller reprex:

pl$lit(list(1L, 2L, NA))

etiennebacher avatar Feb 11 '24 19:02 etiennebacher

Perhaps a mechanism is needed to guess the type and cast it.

I don't know how type conversion is currently done between R and Polars. Related to #425 and #592

eitsupi avatar Feb 12 '24 03:02 eitsupi

There's already a mechanism actually, @sorhawell left this comment here:

https://github.com/pola-rs/r-polars/blob/4e9160a5a3a9f87fa756e7f92932a357340ec179/src/rust/src/conversion_r_to_s.rs#L288-L305

etiennebacher avatar Feb 12 '24 20:02 etiennebacher

I noticed that the arrow package does not support this either.

> arrow::arrow_table(x = list(1L, 2L, NA))
Error: Invalid: cannot convert

> arrow::arrow_table(x = list(1L, 2L, NA_integer_))
Table
3 rows x 1 columns
$x: list<item <int32>>

eitsupi avatar Feb 23 '24 05:02 eitsupi

I think using vctrs::list_of() instead of list() is simpler solution for now. I believe this function has been widely used in the tidyverse world for years.

eitsupi avatar Apr 11 '24 16:04 eitsupi

I was able to make this work in an implementation I am rewriting from scratch using savvy.

> as_polars_series(list(1L, 2L, NA))
shape: (3,)
Series: '' [list[i32]]
[
        [1]
        [2]
        [null]
]

I think the implementation is far cleaner now that I've made it completely branch on the S3 method.

To bring this into this repository would require a considerable rewrite of what is currently there, and I don't have the bandwidth to do it right now, but it should be possible.

as_polars_series.list <- function(x, name = NULL, ...) {
  series_list <- lapply(x, \(child) {
    if (is.null(child)) {
      NULL
    } else {
      as_polars_series(child)$`_s`
    }
  })

  PlRSeries$new_series_list(name %||% "", series_list) |>
    wrap()
}
    fn new_series_list(name: &str, values: ListSexp) -> savvy::Result<Self> {
        let series_vec: Vec<Option<Series>> = values
            .values_iter()
            .map(|value| match value.into_typed() {
                TypedSexp::Null(_) => None,
                TypedSexp::Environment(e) => {
                    let ptr = e.get(".ptr").unwrap().unwrap();
                    let r_series = <&PlRSeries>::try_from(ptr).unwrap();
                    Some(r_series.series.clone())
                }
                _ => panic!("Expected a list of Series"),
            })
            .collect();

        let dtype = series_vec
            .iter()
            .map(|s| {
                if let Some(s) = s {
                    s.dtype().clone()
                } else {
                    DataType::Null
                }
            })
            .reduce(|acc, b| try_get_supertype(&acc, &b).unwrap_or(DataType::String))
            .unwrap_or(DataType::Null);

        let casted_series_vec: Vec<Option<Series>> = series_vec
            .into_iter()
            .map(|s| {
                if let Some(s) = s {
                    Some(s.cast(&dtype).unwrap())
                } else {
                    None
                }
            })
            .collect();

        Ok(Series::new(name, casted_series_vec).into())
    }

eitsupi avatar Jun 14 '24 14:06 eitsupi

Related change in Python Polars 1.0.0 pola-rs/polars#16939 Perhaps two functions may be needed on the Rust side for list, one to autocast and one not to autocast.

eitsupi avatar Jun 22 '24 11:06 eitsupi

Perhaps two functions may be needed on the Rust side for list, one to autocast and one not to autocast.

Probably but I don't think it's related to this issue. The average R user doesn't know (and I think also doesn't need to know) all the variants of NA. Passing NA in a list should work automatically, just like it does when we pass a vector.

etiennebacher avatar Jun 22 '24 13:06 etiennebacher

Ultimately the problem here is that each element in R's list is not guaranteed to have the same type; Apache Arrow's list requires that all elements have the same type, so converting R's list to Arrow's list can either fail without cast or or cast.

Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying NA in something like case_when() would fail due to a type mismatch. This has been fixed by vctrs (tidyverse/dplyr#6300), so we may be able to implement the same thing (cast NA) here by looking at the vctrs implementation.

eitsupi avatar Jun 22 '24 13:06 eitsupi

Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying NA in something like case_when() would fail due to a type mismatch.

Yes but that was far from ideal and raised confusion, which is why they changed the behavior with vctrs. One way to handle that internally would probably be to get the supertype of all non-NA values and then cast NA to this supertype (don't know how to do that though).

etiennebacher avatar Jun 22 '24 13:06 etiennebacher

One way to handle that internally would probably be to get the supertype of all non-NA values and then cast NA to this supertype (don't know how to do that though).

If you want to do it on the R side, you probably need to ask for help from the vctrs package.

If you want to do it on the Rust side, the following approach I took with neo-r-polars (I found the original process for this by searching in the polars repository). https://github.com/eitsupi/neo-r-polars/blob/e4f953ff1f5ea830c777af0d5211e144adeceedf/src/rust/src/series/construction.rs#L76-L97

My concern is that this is extra processing being performed to find the supertype, and that if even one string is mixed in the list, the whole thing becomes string type. I imagine cases where implicit casts are not desired.

eitsupi avatar Jun 22 '24 13:06 eitsupi

I was thinking doing it on the Rust side.

if even one string is mixed in the list, the whole thing becomes string type. I imagine cases where implicit casts are not desired.

That's where the strict parameter is needed. To clarify, I do think this parameter is important, I just think it should ignore NA and therefore is unrelated to this issue. Here's the current behavior with the beta of py-polars 1.0:

>>> pl.Series([[1, 2, "a"]], strict=True) # error

>>> pl.Series([[1, 2, "a"]], strict=False)
shape: (1,)
Series: '' [list[str]]
[
        ["1", "2", "a"]
]

etiennebacher avatar Jun 22 '24 14:06 etiennebacher

I guess it depends on how far we extend the "ignore NA" thing. For example, is it only length 1, or is the length arbitrary and all elements are polars' null?

I can also create a vector of length 0 in R, which in Polars would be a Series of length 0 with an explicit type, how should this ideally work?

> polars::as_polars_series(list(character(), integer()))
Error: Execution halted with the following contexts
   0: In R: in as_polars_series():
   0: During function call [polars::as_polars_series(list(character(), integer()))]
   1: Encountered the following error in Rust-Polars:
        When building series from R list; some parsed sub-elements did not match: One element was str and another was i32

eitsupi avatar Jun 22 '24 14:06 eitsupi

In my opinion NULL should be used to represent missing values in R's list, and currently NULL works properly in Polars as well, being a Series of type null. is there any reason why you would want to use NA instead of NULL?

> polars::as_polars_series(list(1, 2, NULL, 3))
polars Series: shape: (4,)
Series: '' [list[f64]]
[
        [1.0]
        [2.0]
        []
        [3.0]
]

Edit: The implementation of neo-r-polars would look like this. Perhaps this is preferable?

> as_polars_series(list(1, 2, NULL, 3))
shape: (4,)
Series: '' [list[f64]]
[
        [1.0]
        [2.0]
        null
        [3.0]
]

eitsupi avatar Jun 22 '24 14:06 eitsupi

Now works on the next branch.

library(neopolars)

as_polars_series(list(1, NULL, 1L), strict = TRUE)
#> Error in `as_polars_series()`:
#> ! Evaluation failed.
#> Caused by error:
#> ! If `strict = TRUE`, all elements of the list except `NULL` must have the same datatype. expected: `f64`, got: `i32` at index: 3
as_polars_series(list(1, NULL, 1L))
#> shape: (3,)
#> Series: '' [list[f64]]
#> [
#>  [1.0]
#>  null
#>  [1.0]
#> ]

Created on 2024-09-06 with reprex v2.1.1

Probably the safest choice for most users is to use vctrs::list_of().

eitsupi avatar Sep 06 '24 08:09 eitsupi

Does it work with NA as in the example of the first post?

etiennebacher avatar Sep 06 '24 08:09 etiennebacher

No, only NULLs are treated specially; use strict=FALSE to cast NA to NA_integer_.

eitsupi avatar Sep 06 '24 08:09 eitsupi

I was initially going to treat only the polars' Null data type specially, but gave up when I realized that when passing a vector of Series to the Series::new() function, if the first element is Null, all subsequent Series will be cast to Null and the values will all be null. https://github.com/pola-rs/r-polars/blob/5d5be1f4e1c2cb96ed1450fb270a14a695fc5612/src/rust/src/series/construction.rs#L104-L113

Since even Series of length 0 and None are treated differently in this way, it is quite distorted to give special treatment to NAs that fall under Series of length 1.

eitsupi avatar Sep 06 '24 08:09 eitsupi

No, only NULLs are treated specially; use strict=FALSE to cast NA to NA_integer_.

In this case, could we have strict=FALSE by default? I don't think many average users know the different variants of NA and want to use NA whatever the datatype

etiennebacher avatar Sep 06 '24 10:09 etiennebacher

In this case, could we have strict=FALSE by default? I don't think many average users know the different variants of NA and want to use NA whatever the datatype

Of course it would be better that way. vctrs::list_of() also automatically casts the super type by default.

https://github.com/pola-rs/r-polars/blob/5d5be1f4e1c2cb96ed1450fb270a14a695fc5612/R/as_polars_series.R#L274

The situation is so different from Python that perhaps the default value should be FALSE even if this argument is to be exposed to pl$Series() (it is not currently exposed).

eitsupi avatar Sep 06 '24 11:09 eitsupi