r-polars
r-polars copied to clipboard
`NA` should be automatically cast to the right type when creating a `DataFrame` with list-columns
Very annoying to have to specify NA_integer_
or the other variants:
library(polars)
options(polars.do_not_repeat_call = TRUE)
pl$DataFrame(x = list(1L, 2L, NA_integer_))
#> shape: (3, 1)
#> ┌───────────┐
#> │ x │
#> │ --- │
#> │ list[i32] │
#> ╞═══════════╡
#> │ [1] │
#> │ [2] │
#> │ [null] │
#> └───────────┘
# NA should be automatically cast to the correct type
pl$DataFrame(x = list(1L, 2L, NA))
#> Error: Execution halted with the following contexts
#> 0: In R: in $DataFrame():
#> 1: When constructing polars literal from Robj
#> 2: Encountered the following error in Rust-Polars:
#> expected Series
Smaller reprex:
pl$lit(list(1L, 2L, NA))
Perhaps a mechanism is needed to guess the type and cast it.
I don't know how type conversion is currently done between R and Polars. Related to #425 and #592
There's already a mechanism actually, @sorhawell left this comment here:
https://github.com/pola-rs/r-polars/blob/4e9160a5a3a9f87fa756e7f92932a357340ec179/src/rust/src/conversion_r_to_s.rs#L288-L305
I noticed that the arrow package does not support this either.
> arrow::arrow_table(x = list(1L, 2L, NA))
Error: Invalid: cannot convert
> arrow::arrow_table(x = list(1L, 2L, NA_integer_))
Table
3 rows x 1 columns
$x: list<item <int32>>
I think using vctrs::list_of()
instead of list()
is simpler solution for now.
I believe this function has been widely used in the tidyverse world for years.
I was able to make this work in an implementation I am rewriting from scratch using savvy.
> as_polars_series(list(1L, 2L, NA))
shape: (3,)
Series: '' [list[i32]]
[
[1]
[2]
[null]
]
I think the implementation is far cleaner now that I've made it completely branch on the S3 method.
To bring this into this repository would require a considerable rewrite of what is currently there, and I don't have the bandwidth to do it right now, but it should be possible.
as_polars_series.list <- function(x, name = NULL, ...) {
series_list <- lapply(x, \(child) {
if (is.null(child)) {
NULL
} else {
as_polars_series(child)$`_s`
}
})
PlRSeries$new_series_list(name %||% "", series_list) |>
wrap()
}
fn new_series_list(name: &str, values: ListSexp) -> savvy::Result<Self> {
let series_vec: Vec<Option<Series>> = values
.values_iter()
.map(|value| match value.into_typed() {
TypedSexp::Null(_) => None,
TypedSexp::Environment(e) => {
let ptr = e.get(".ptr").unwrap().unwrap();
let r_series = <&PlRSeries>::try_from(ptr).unwrap();
Some(r_series.series.clone())
}
_ => panic!("Expected a list of Series"),
})
.collect();
let dtype = series_vec
.iter()
.map(|s| {
if let Some(s) = s {
s.dtype().clone()
} else {
DataType::Null
}
})
.reduce(|acc, b| try_get_supertype(&acc, &b).unwrap_or(DataType::String))
.unwrap_or(DataType::Null);
let casted_series_vec: Vec<Option<Series>> = series_vec
.into_iter()
.map(|s| {
if let Some(s) = s {
Some(s.cast(&dtype).unwrap())
} else {
None
}
})
.collect();
Ok(Series::new(name, casted_series_vec).into())
}
Related change in Python Polars 1.0.0 pola-rs/polars#16939 Perhaps two functions may be needed on the Rust side for list, one to autocast and one not to autocast.
Perhaps two functions may be needed on the Rust side for list, one to autocast and one not to autocast.
Probably but I don't think it's related to this issue. The average R user doesn't know (and I think also doesn't need to know) all the variants of NA
. Passing NA
in a list should work automatically, just like it does when we pass a vector.
Ultimately the problem here is that each element in R's list is not guaranteed to have the same type; Apache Arrow's list requires that all elements have the same type, so converting R's list to Arrow's list can either fail without cast or or cast.
Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying NA
in something like case_when()
would fail due to a type mismatch.
This has been fixed by vctrs (tidyverse/dplyr#6300), so we may be able to implement the same thing (cast NA) here by looking at the vctrs implementation.
Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying NA in something like case_when() would fail due to a type mismatch.
Yes but that was far from ideal and raised confusion, which is why they changed the behavior with vctrs
. One way to handle that internally would probably be to get the supertype of all non-NA values and then cast NA to this supertype (don't know how to do that though).
One way to handle that internally would probably be to get the supertype of all non-NA values and then cast NA to this supertype (don't know how to do that though).
If you want to do it on the R side, you probably need to ask for help from the vctrs package.
If you want to do it on the Rust side, the following approach I took with neo-r-polars (I found the original process for this by searching in the polars repository). https://github.com/eitsupi/neo-r-polars/blob/e4f953ff1f5ea830c777af0d5211e144adeceedf/src/rust/src/series/construction.rs#L76-L97
My concern is that this is extra processing being performed to find the supertype, and that if even one string is mixed in the list, the whole thing becomes string type. I imagine cases where implicit casts are not desired.
I was thinking doing it on the Rust side.
if even one string is mixed in the list, the whole thing becomes string type. I imagine cases where implicit casts are not desired.
That's where the strict
parameter is needed. To clarify, I do think this parameter is important, I just think it should ignore NA
and therefore is unrelated to this issue. Here's the current behavior with the beta of py-polars 1.0:
>>> pl.Series([[1, 2, "a"]], strict=True) # error
>>> pl.Series([[1, 2, "a"]], strict=False)
shape: (1,)
Series: '' [list[str]]
[
["1", "2", "a"]
]
I guess it depends on how far we extend the "ignore NA" thing. For example, is it only length 1, or is the length arbitrary and all elements are polars' null?
I can also create a vector of length 0 in R, which in Polars would be a Series of length 0 with an explicit type, how should this ideally work?
> polars::as_polars_series(list(character(), integer()))
Error: Execution halted with the following contexts
0: In R: in as_polars_series():
0: During function call [polars::as_polars_series(list(character(), integer()))]
1: Encountered the following error in Rust-Polars:
When building series from R list; some parsed sub-elements did not match: One element was str and another was i32
In my opinion NULL should be used to represent missing values in R's list, and currently NULL works properly in Polars as well, being a Series of type null. is there any reason why you would want to use NA instead of NULL?
> polars::as_polars_series(list(1, 2, NULL, 3))
polars Series: shape: (4,)
Series: '' [list[f64]]
[
[1.0]
[2.0]
[]
[3.0]
]
Edit: The implementation of neo-r-polars would look like this. Perhaps this is preferable?
> as_polars_series(list(1, 2, NULL, 3))
shape: (4,)
Series: '' [list[f64]]
[
[1.0]
[2.0]
null
[3.0]
]
Now works on the next
branch.
library(neopolars)
as_polars_series(list(1, NULL, 1L), strict = TRUE)
#> Error in `as_polars_series()`:
#> ! Evaluation failed.
#> Caused by error:
#> ! If `strict = TRUE`, all elements of the list except `NULL` must have the same datatype. expected: `f64`, got: `i32` at index: 3
as_polars_series(list(1, NULL, 1L))
#> shape: (3,)
#> Series: '' [list[f64]]
#> [
#> [1.0]
#> null
#> [1.0]
#> ]
Created on 2024-09-06 with reprex v2.1.1
Probably the safest choice for most users is to use vctrs::list_of()
.
Does it work with NA as in the example of the first post?
No, only NULL
s are treated specially; use strict=FALSE
to cast NA
to NA_integer_
.
I was initially going to treat only the polars' Null
data type specially, but gave up when I realized that when passing a vector of Series to the Series::new()
function, if the first element is Null
, all subsequent Series will be cast to Null
and the values will all be null
.
https://github.com/pola-rs/r-polars/blob/5d5be1f4e1c2cb96ed1450fb270a14a695fc5612/src/rust/src/series/construction.rs#L104-L113
Since even Series of length 0 and None
are treated differently in this way, it is quite distorted to give special treatment to NA
s that fall under Series of length 1.
No, only NULLs are treated specially; use strict=FALSE to cast NA to NA_integer_.
In this case, could we have strict=FALSE
by default? I don't think many average users know the different variants of NA
and want to use NA
whatever the datatype
In this case, could we have
strict=FALSE
by default? I don't think many average users know the different variants ofNA
and want to useNA
whatever the datatype
Of course it would be better that way.
vctrs::list_of()
also automatically casts the super type by default.
https://github.com/pola-rs/r-polars/blob/5d5be1f4e1c2cb96ed1450fb270a14a695fc5612/R/as_polars_series.R#L274
The situation is so different from Python that perhaps the default value should be FALSE
even if this argument is to be exposed to pl$Series()
(it is not currently exposed).