vctrs
vctrs copied to clipboard
`vec_locate_runs()` and `vec_partition()`
vec_locate_runs()
finds the locations of the last values of consecutive runs in a vector:
vec_detect_runs <- function(x) {
n <- vec_size(x)
if (!n) {
return(int())
}
diffs <- vec_slice(x, -1L) != vec_slice(x, -n)
c(diffs, TRUE)
}
vec_locate_runs <- function(x) {
which(vec_detect_runs(x))
}
x <- sample(1:2, 10, TRUE)
x
#> [1] 2 2 1 1 1 1 2 1 1 1
vec_locate_runs(x)
#> [1] 2 6 7 10
vec_locate_runs(rep(1, 10))
#> [1] 10
vec_locate_runs(c(0, rep(1, 9)))
#> [1] 1 10
vec_locate_runs(c(rep(1, 9), 0))
#> [1] 9 10
vec_locate_runs(1:2)
#> [1] 1 2
vec_partition()
partitions a vector into a list based on pivotal values:
seq0 <- function(from, to) {
if (length(from) != 1) {
abort("`from` must be length one")
}
if (length(to) != 1) {
abort("`to` must be length one")
}
if (from > to) {
abort("Can't create decreasing sequence")
}
seq.int(from, to)
}
vec_partition <- function(x, runs) {
runs <- vec_cast(runs, integer())
runs_size <- length(runs)
if (!runs_size) {
return(list(x))
}
x_size <- vec_size(x)
runs <- sort(runs)
last_run <- runs[[runs_size]]
if (last_run > x_size) {
abort("`runs` can't exceed the vector size.")
}
if (last_run != x_size) {
# Ensure a full partition of `x`
runs <- c(runs, x_size)
}
out <- vec_init(list(), runs_size)
first <- 1L
for (i in seq_along(runs)) {
last <- runs[[i]]
out[[i]] <- vec_slice(x, seq0(first, last))
first <- last + 1L
}
out
}
vec_partition(1:10, c(1, 10))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2 3 4 5 6 7 8 9 10
vec_partition(1:10, c(1, 9))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2 3 4 5 6 7 8 9
#>
#> [[3]]
#> [1] 10
vec_partition(1:10, c(2, 9))
#> [[1]]
#> [1] 1 2
#>
#> [[2]]
#> [1] 3 4 5 6 7 8 9
#>
#> [[3]]
#> [1] 10
It can be used with vec_locate_runs()
to partition the runs in a vector:
x <- c("foo", "foo", "quux", "baz", "baz", "foo", "bar")
vec_partition(x, vec_locate_runs(x))
#> [[1]]
#> [1] "foo" "foo"
#>
#> [[2]]
#> [1] "quux"
#>
#> [[3]]
#> [1] "baz" "baz"
#>
#> [[4]]
#> [1] "foo"
#>
#> [[5]]
#> [1] "bar"
Compare to vec_split()
:
vec_split(x, x)$val
#> [[1]]
#> [1] "foo" "foo" "foo"
#>
#> [[2]]
#> [1] "quux"
#>
#> [[3]]
#> [1] "baz" "baz"
#>
#> [[4]]
#> [1] "bar"
Related to vec_identify_runs()
in https://github.com/r-lib/vctrs/issues/1081
I could see vec_locate_runs(x, at = c("first", "last"))
where "first"
returns the first location in the current run, rather than the last. "last"
might be more practically useful though, as seen in vec_partition()
See also: https://github.com/r-lib/vctrs/pull/1243#discussion_r484970437
Which mentions:
-
vec_partition(x, by)
likevec_split(x, by)
-
vec_partition_at(x, loc)
rather thanvec_partition(x, runs)
here (whereloc
here might make more sense as starts?)
Would this make more sense as vec_split_runs()
?
Oh interesting. Yea, I've been gathering functions that split their input and return a 2-col data frame in the google sheet under vec_split_*()
, and I think this fits. So maybe:
-
vec_split_runs(x, by = x)
-
vec_partition(x, at / loc)
Partitioning by providing start points still seems more intuitive to me for some reason.
I have a use case for vec_locate_runs(x, start = TRUE/FALSE)
so I think we should export that and vec_detect_runs()
now (both already in vctrs through #1248, just need to be exported).
I think we should make the start
argument named the same way as the argument we plan to add to vec_unique()
that determines whether the first or last unique value is returned (https://github.com/r-lib/vctrs/issues/1442). In theory we could have this set of functions:
vec_unique(x, which = c("first", "last"))
vec_locate_unique(x, which = c("first", "last"))
vec_locate_runs(x, which = c("first", "last"))
vec_detect_unique(x, which = c("first", "last"))
vec_detect_runs(x, which = c("first", "last"))
vec_identify_unique(x) # no need for this really
vec_identify_runs(x)
vec_locate_unique_groups(x) # right now this is vec_group_loc()
vec_locate_run_groups(x)
Might as well do vec_partition()
while we are at it, since that seems useful and would be useful as a replacement for the idea from #1643
I've realized that it might be way easier to understand if vec_partition()
took sizes
rather than run starts
or ends
. That is way less ambiguous, and composes well with list_unchop()
when you want to: unchop, apply a function, rechop.
Hadley had a stringr use case that was kind of like
xs <- list("a", c("b", "c"), c("d", "e", "f"))
x <- list_unchop(xs)
x <- toupper(x)
x
sizes <- list_sizes(xs)
by <- vec_rep_each(seq_along(sizes), times = sizes)
vec_split(x, by)
# instead:
# vec_partition(x, sizes)
Concrete semantics proposal:
library(vctrs)
vec_partition <- function(x, sizes) {
size <- vec_size(x)
if (size != sum(sizes)) {
abort("`sizes` must add up to size of `x`.")
}
starts <- c(1L, sizes)
starts <- cumsum(starts)[-length(starts)]
starts <- starts - 1L # 0-based
vctrs:::vec_chop_seq(x, starts, sizes)
}
vec_count_runs <- function(x) {
vec_count(vec_identify_runs(x), sort = "location")$count
}
Giving us this for the original example:
x <- c("foo", "foo", "quux", "baz", "baz", "foo", "bar")
vec_count_runs(x)
#> [1] 2 1 2 1 1
vec_partition(x, vec_count_runs(x))
#> [[1]]
#> [1] "foo" "foo"
#>
#> [[2]]
#> [1] "quux"
#>
#> [[3]]
#> [1] "baz" "baz"
#>
#> [[4]]
#> [1] "foo"
#>
#> [[5]]
#> [1] "bar"
And this for Hadley's case:
xs <- list("a", c("b", "c"), c("d", "e", "f"))
x <- list_unchop(xs)
x <- toupper(x)
sizes <- list_sizes(xs)
vec_partition(x, sizes)
#> [[1]]
#> [1] "A"
#>
#> [[2]]
#> [1] "B" "C"
#>
#> [[3]]
#> [1] "D" "E" "F"
There is probably still something useful about vec_locate_runs()
, in the same way that there is something useful about vec_locate_unique()
. Being able to vec_slice()
the start/end of the run is likely very useful. But I think using counts for this chop operation is probably best.
If vec_partition()
is potentially going to be used as follow up to list_unchop()
, consider that it may be useful to call it something with "chop" in the name for symmetry. Like vec_chop_runs()
or vec_chop_sizes()
.