purrr
purrr copied to clipboard
Implement `map_vec()`
This PR implements map_vec()
, a vctrs powered map()
that attempts to simplify the output automatically by computing the common type of the result elements. If a common type can be found, the result will be simplified to that type. If not, a list is returned.
This is a draft PR for now as I'm not currently proposing that it be merged. I'd like to get feedback and see what others think about it.
- [ ] Add tests
- [ ] Add examples
This function maintains the invariant that:
vec_size(.x) == vec_size(map_vec(.x))
After glancing at the issues, it seems this PR touches: #679, #633, #472, #376
For now, here are a number of examples that demonstrate the features of map_vec()
:
library(vctrs)
library(purrr)
library(tibble)
options(rlang_backtrace_on_error = "reminder")
# Auto simplified to an integer vector
map_vec(1:2, ~ .x)
#> [1] 1 2
# Prevent simplification with `.ptype = list()`
map_vec(1:2, ~ .x, .ptype = list())
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
# If a `.ptype` is specified, and casting to that type
# is not possible, an error is raised
map_vec(1:2, ~ .x, .ptype = factor())
#> Can't cast <integer> to <factor<>>
# If simplification is possible, all elements must have size 1,
# otherwise an error is raised
map_vec(1:2, ~ if (.x == 1L) 1:2 else 3)
#> Result 1 must be a single double, not an integer vector of length 2
# If you use `.ptype = list()`, this is pseudo-relaxed
# (The size of the output is still the same as the
# size of the input (2), but each inner element can vary in size)
map_vec(1:2, ~ if (.x == 1L) 1:2 else 3, .ptype = list())
#> [[1]]
#> [1] 1 2
#>
#> [[2]]
#> [1] 3
# The best thing about `map_vec()` is its flexibility with other
# non-atomic types, for example, simplifying to a date vector
map_vec(1:2, ~ Sys.Date() + .x)
#> [1] "2019-06-19" "2019-06-20"
# If a common type cannot be determined, a list is returned
map_vec(list(1, "x"), ~ .x)
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] "x"
# Note that just because a common type isn't found doesn't mean you
# can't still forcibly cast to a certain type.
map_vec(list(1, "x"), ~ .x, .ptype = character())
#> [1] "1" "x"
# The underlying `vec_c()` engine knows how to combine
# data frame output, but this isn't a `map_dfr()` replacement
map_vec(1:2, ~ tibble(x = .x))
#> # A tibble: 2 x 1
#> x
#> <int>
#> 1 1
#> 2 2
# You can enforce the structure of the data frame output with a ptype.
# This has the same result as before but coerces the integers to characters
ptype <- tibble(x = character())
map_vec(1:2, ~ tibble(x = .x), .ptype = ptype)
#> # A tibble: 2 x 1
#> x
#> <chr>
#> 1 1
#> 2 2
# And this errors
map_vec(1:2, ~ tibble(x = .x, y = 1), .ptype = ptype)
#> Lossy cast from <tbl_df<
#> x: integer
#> y: double
#> >> to <tbl_df<x:character>>.
#> Dropped variables: `y`
# Or you can enforce a partial structure with a partial_frame()
# This ensures that a `y` column exists
partial_ptype <- partial_frame(y = numeric())
map_vec(1:2, ~ tibble(x = .x), .ptype = partial_ptype)
#> # A tibble: 2 x 2
#> x y
#> <int> <dbl>
#> 1 1 NA
#> 2 2 NA
# Turning on strict mode forces the user to provide a .ptype
withr::with_options(
list(vctrs.no_guessing = TRUE),
map_vec(1:5, ~ .x)
)
#> Error: strict mode is activated; you must supply complete `.ptype`.
A few questions:
- Should a
.name_repair
argument be passed tovec_c()
? - If a
.ptype
is specified andvec_c()
fails because it cannot cast to the.ptype
, we currently get a vctrs cast error. Should that be more informative? Can it be (we don't have the index where the cast failure occured)?
Other things vctrs can help with:
- Using
vec_rbind()
inmap_dfr()
- Using
vec_cbind()
inmap_dfc()
- Possibly powering
map_dbl()
and friends by just doingmap_vec(.ptype = double())
? - Altering the current recycling rules in purrr. Current thinking suggests that this should produce an error
purrr::map2(1:2, integer(), ~ .x)
Last time we talked about this with Hadley, the idea was to have funs::s()
to simplify an output. Then our sapply would be s(map(...))
. This approach seems better than multiplying the number of functions in purrr / other packages as it is more composable.
Similarly, map_int(...)
could be int(map(...))
. However the typed variants are more efficient because they can omit the intermediate representation as a list. In general, performance is a reason to provide variants.
Does this suggest we should have map_vec()
in addition to s(map())
if it can help performance? The prototype should be computed in advance and then we'd build the vector and vec_slice into it. This would require exporting some of the C API from vctrs.
I see that mention of funs::s()
here. I guess it would be like s(x, ptype = NULL)
?
I am a bit worried about performance here. If purrr was being implemented from scratch today, I feel like a C version of map_vec()
would still exist just for its performance gains. I think even map_int()
and some of the more common variants might even still exist because it is so much easier to type than map_vec(x, fun, .ptype = int())
when you want to guarantee that the result is an integer + have the performance.
I do think generally funs::s()
will be a nice solution to not multiply the number of these kinds of variants in other packages, but purrr feels like an exception.
Thought: How can we compute the prototype of the output in advance if we need all elements of the output to determine the common type? Unless, .ptype
is supplied, we can't? Which means 2 allocations are needed, one for a list to contain the n
results, and then one for the common type container of size n
.
Superseded by #894