vctrs
vctrs copied to clipboard
Add coercion methods between `list()` and `list_of()`
Closes #1161
What is the richer type between list and list-of? Following the factor logic, the unconstrained type character
covers more values than factor
and is thus richer:
vec_c(factor("foo"), "bar")
#> [1] "foo" "bar"
However, factor
has been implemented with lax coercibility:
vec_c(factor("foo"), factor("bar"))
#> [1] foo bar
#> Levels: foo bar
This means you get consistent results when a character is thrown into the mix:
vec_c(factor("foo"), "bar", factor("baz"))
#> [1] "foo" "bar" "baz"
vec_c("foo", factor("bar"), factor("baz"))
#> [1] "foo" "bar" "baz"
vec_c(factor("bar"), factor("bar"), "baz")
#> [1] "bar" "bar" "baz"
list_of
is different, it has strict coercibility:
vec_c(list_of(1), list_of(""))
#> Error: Can't combine `..1` <double> and `..2` <character>.
so we can't consider list
to be the richer type. That would mean list-of could easily lose its restrictions and also combining incompatible list_of
values would no longer result in an error when a bare list is thrown into the mix:
# If we coerce list-of to list:
vec_c(list_of(1), list(TRUE), list_of(""))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] TRUE
#>
#> [[3]]
#> [1] ""
For this reason, the richer type is list-of because its richer behaviour (strictness) is the important factor. By coercing lists to list-ofs, we get the expected error:
vec_c(list_of(1), list(TRUE), list_of(""))
#> Error: Can't combine `..1` <double> and `..3` <character>.
Note that if we generalise the base type fallback as planned in #1135, this means we'll lose this property of list-of with any foreign class that explicitly inherits from "list". This is my main concern with this fallback. Maybe we could introduce a generic vec_disable_fallback()
that could be optionally implemented to disable it.
My main concern is that we're leaning too strict here — is there some compelling reason that list_of()
should be fundamentally stricter than factor()
? I think your arguments are reasonable from a theoretical perspective, but I worry that pragmatically this is going to cause too much pain for R users, especially since the chances that they've heard of list_of
is slim. I wonder if our principle should be that you'll keep the "list of"-ness if you're careful, but if you don't care about it, you'll just get a list.
Maybe we need some option so that we can choose to be strict while working on packages (so list_of
is propagated everywhere we believe it should be), but when doing data analysis, you'll end up with simple lists if you don't take extra care?
I would be fine with lax lists, as long as they are consistently lax. For instance vec_c(list_of(1), list_of(""))
should then return a list. But what is the purpose for using a list_of()
e.g. in tidyr? Is it to keep information about the common type as long as the list is homogeneous?
I like the idea of having strict and lax variants. Similarly, a strict factor subclass could also make sense.
I think it's used in tidyr because it allows for some performance optimisations; so I think lax by default is probably the better bet.
One reason for list-of in tidyr is consistent chop/unchop roundtripping https://github.com/tidyverse/tidyr/blob/a170e55890cbebe257acf83103003e9f4f123dff/R/chop.R#L20-L25
Here is a dev tidyr example where the strictness currently comes into play. I think this current behavior has nice theoretical properties, but it might be too strict. I think I wouldn't mind if all 3 of these "worked" and gave the result we see from the first example (i.e. make list-of + list-of coercion lax).
library(tidyr)
library(vctrs)
# Mixed types while unnesting wider.
# This is fine because the lists aren't typed.
col1 <- list(
list(a = 1:2),
list(a = "x")
)
col2 <- list(
list(b = c("x", "y")),
list(b = 1)
)
df <- tibble(col1 = col1, col2 = col2)
df
#> # A tibble: 2 × 2
#> col1 col2
#> <list> <list>
#> 1 <named list [1]> <named list [1]>
#> 2 <named list [1]> <named list [1]>
unnest_wider(df, c(col1, col2))
#> # A tibble: 2 × 2
#> a b
#> <list> <list>
#> 1 <int [2]> <chr [2]>
#> 2 <chr [1]> <dbl [1]>
# Mixed list-of types.
# For `"a"`, can't combine list_of<integer> and list_of<character>
col1 <- list(
list_of(a = 1:2),
list_of(a = "x")
)
col2 <- list(
list_of(b = c("x", "y")),
list_of(b = 1)
)
df <- tibble(col1 = col1, col2 = col2)
df
#> # A tibble: 2 × 2
#> col1 col2
#> <list> <list>
#> 1 <list<int> [1]> <list<chr> [1]>
#> 2 <list<chr> [1]> <list<dbl> [1]>
unnest_wider(df, c(col1, col2))
#> Error: Can't combine <integer> and <character>.
# The above is essentially what happens here:
col <- list(
tibble(a = 1:2, b = c("x", "y")),
tibble(a = "x", b = 1)
)
df <- tibble(col = col)
df
#> # A tibble: 2 × 1
#> col
#> <list>
#> 1 <tibble [2 × 2]>
#> 2 <tibble [1 × 2]>
unnest_wider(df, col)
#> Error: Can't combine <integer> and <character>.
Created on 2021-11-07 by the reprex package (v2.0.1)
This also showed up in a revdep https://github.com/ffverse/ffscrapr/pull/344
The problem essentially boiled down to the fact that we can't currently combine list + list_of<integer>
even if the list
has compatible elements.
library(tidyr)
library(vctrs)
col <- list(
list(a = 1:2),
list_of(a = 1L)
)
df <- tibble(col = col)
df
#> # A tibble: 2 × 1
#> col
#> <list>
#> 1 <named list [1]>
#> 2 <list<int> [1]>
unnest_wider(df, col)
#> Error: Can't combine `..1$a` <list> and `..3$a` <list_of<integer>>.
This should "work", but, as Lionel mentioned, the direction depends on how strict the coercion ends up being:
- If strict
list_of<int> + list_of<chr> = error
coercion, thenlist + list_of<int> = list_of<int>
if possible, otherwise error - If lax
list_of<int> + list_of<chr> = list
coercion, thenlist + list_of<int> = list
If strict, list + list_of
= list_of if possible, otherwise error
As I've just realised with @mgirlich, this would mean that the common type is not the richer type, and then vec_cast()
might fail when vec_ptype2()
succeeds.
and then vec_cast() might fail when vec_ptype2() succeeds.
oooh I don't like that if we can avoid it
Looking at this again, I think I'd be fine with this kind of lax list:
list_of<int> + list_of<int> = list_of<int> # same inner type
list_of<int> + list_of<dbl> = list_of<dbl> # inner type coercion is possible
list_of<int> + list_of<chr> = list / possibly error # inner type coercion is not possible - i.e. using vec_is_coercible
list_of<int> + list = list # use richer type of list
I can't decide if I want list_of<int> + list_of<chr>
to be an error or not. There is something nice about saying that this coercion is just delegated to whether or not the inner types have a common type or not. But, if it is an error, then you get non-symmetric results with things like this (as you mentioned at the beginning of this PR):
vec_c(list_of<int>, list, list_of<chr>) = list
vec_c(list_of<int>, list_of<chr>, list) = error
And that's pretty bad, so I don't think we can do that
yup lax list-ofs don't have this problem because they coerce to the richer type. I'll finish this PR before release too.
Highly related to changing tidyr nest()
to return list-ofs. The current coercion rules are a little too strict for us to be able to do this, but it would be nice because it would allow roundtriping nest/unnest on empty tibbles https://github.com/tidyverse/tidyr/issues/1393
Sounds good I'll get to it this week.