vctrs icon indicating copy to clipboard operation
vctrs copied to clipboard

`vec_locate_runs()` and `vec_partition()`

Open lionel- opened this issue 3 years ago • 7 comments

vec_locate_runs() finds the locations of the last values of consecutive runs in a vector:

vec_detect_runs <- function(x) {
  n <- vec_size(x)
  if (!n) {
    return(int())
  }

  diffs <- vec_slice(x, -1L) != vec_slice(x, -n)
  c(diffs, TRUE)
}
vec_locate_runs <- function(x) {
  which(vec_detect_runs(x))
}

x <- sample(1:2, 10, TRUE)
x
#>  [1] 2 2 1 1 1 1 2 1 1 1

vec_locate_runs(x)
#> [1]  2  6  7 10

vec_locate_runs(rep(1, 10))
#> [1] 10
vec_locate_runs(c(0, rep(1, 9)))
#> [1]  1 10
vec_locate_runs(c(rep(1, 9), 0))
#> [1]  9 10
vec_locate_runs(1:2)
#> [1] 1 2

vec_partition() partitions a vector into a list based on pivotal values:

seq0 <- function(from, to) {
  if (length(from) != 1) {
    abort("`from` must be length one")
  }
  if (length(to) != 1) {
    abort("`to` must be length one")
  }

  if (from > to) {
    abort("Can't create decreasing sequence")
  }

  seq.int(from, to)
}

vec_partition <- function(x, runs) {
  runs <- vec_cast(runs, integer())

  runs_size <- length(runs)
  if (!runs_size) {
    return(list(x))
  }

  x_size <- vec_size(x)
  runs <- sort(runs)

  last_run <- runs[[runs_size]]
  if (last_run > x_size) {
    abort("`runs` can't exceed the vector size.")
  }
  if (last_run != x_size) {
    # Ensure a full partition of `x`
    runs <- c(runs, x_size)
  }

  out <- vec_init(list(), runs_size)
  first <- 1L

  for (i in seq_along(runs)) {
    last <- runs[[i]]
    out[[i]] <- vec_slice(x, seq0(first, last))
    first <- last + 1L
  }

  out
}

vec_partition(1:10, c(1, 10))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1]  2  3  4  5  6  7  8  9 10

vec_partition(1:10, c(1, 9))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2 3 4 5 6 7 8 9
#>
#> [[3]]
#> [1] 10

vec_partition(1:10, c(2, 9))
#> [[1]]
#> [1] 1 2
#>
#> [[2]]
#> [1] 3 4 5 6 7 8 9
#>
#> [[3]]
#> [1] 10

It can be used with vec_locate_runs() to partition the runs in a vector:

x <- c("foo", "foo", "quux", "baz", "baz", "foo", "bar")

vec_partition(x, vec_locate_runs(x))
#> [[1]]
#> [1] "foo" "foo"
#>
#> [[2]]
#> [1] "quux"
#>
#> [[3]]
#> [1] "baz" "baz"
#>
#> [[4]]
#> [1] "foo"
#>
#> [[5]]
#> [1] "bar"

Compare to vec_split():

vec_split(x, x)$val
#> [[1]]
#> [1] "foo" "foo" "foo"
#>
#> [[2]]
#> [1] "quux"
#>
#> [[3]]
#> [1] "baz" "baz"
#>
#> [[4]]
#> [1] "bar"

lionel- avatar Aug 06 '20 08:08 lionel-

Related to vec_identify_runs() in https://github.com/r-lib/vctrs/issues/1081

I could see vec_locate_runs(x, at = c("first", "last")) where "first" returns the first location in the current run, rather than the last. "last" might be more practically useful though, as seen in vec_partition()

DavisVaughan avatar Sep 02 '20 15:09 DavisVaughan

See also: https://github.com/r-lib/vctrs/pull/1243#discussion_r484970437

Which mentions:

  • vec_partition(x, by) like vec_split(x, by)

  • vec_partition_at(x, loc) rather than vec_partition(x, runs) here (where loc here might make more sense as starts?)

DavisVaughan avatar Sep 08 '20 14:09 DavisVaughan

Would this make more sense as vec_split_runs()?

lionel- avatar Sep 08 '20 14:09 lionel-

Oh interesting. Yea, I've been gathering functions that split their input and return a 2-col data frame in the google sheet under vec_split_*(), and I think this fits. So maybe:

  • vec_split_runs(x, by = x)

  • vec_partition(x, at / loc)

Partitioning by providing start points still seems more intuitive to me for some reason.

DavisVaughan avatar Sep 08 '20 15:09 DavisVaughan

I have a use case for vec_locate_runs(x, start = TRUE/FALSE) so I think we should export that and vec_detect_runs() now (both already in vctrs through #1248, just need to be exported).

I think we should make the start argument named the same way as the argument we plan to add to vec_unique() that determines whether the first or last unique value is returned (https://github.com/r-lib/vctrs/issues/1442). In theory we could have this set of functions:

vec_unique(x, which = c("first", "last"))

vec_locate_unique(x, which = c("first", "last"))
vec_locate_runs(x, which = c("first", "last"))

vec_detect_unique(x, which = c("first", "last"))
vec_detect_runs(x, which = c("first", "last"))

vec_identify_unique(x) # no need for this really
vec_identify_runs(x)

vec_locate_unique_groups(x) # right now this is vec_group_loc()
vec_locate_run_groups(x)

Might as well do vec_partition() while we are at it, since that seems useful and would be useful as a replacement for the idea from #1643

DavisVaughan avatar Sep 06 '22 16:09 DavisVaughan

I've realized that it might be way easier to understand if vec_partition() took sizes rather than run starts or ends. That is way less ambiguous, and composes well with list_unchop() when you want to: unchop, apply a function, rechop.

Hadley had a stringr use case that was kind of like

xs <- list("a", c("b", "c"), c("d", "e", "f"))

x <- list_unchop(xs)
x <- toupper(x)
x

sizes <- list_sizes(xs)
by <- vec_rep_each(seq_along(sizes), times = sizes)
vec_split(x, by)

# instead:
# vec_partition(x, sizes)

DavisVaughan avatar Oct 03 '22 22:10 DavisVaughan

Concrete semantics proposal:

library(vctrs)

vec_partition <- function(x, sizes) {
  size <- vec_size(x)
  
  if (size != sum(sizes)) {
    abort("`sizes` must add up to size of `x`.")
  }
  
  starts <- c(1L, sizes)
  starts <- cumsum(starts)[-length(starts)]
  starts <- starts - 1L # 0-based
  
  vctrs:::vec_chop_seq(x, starts, sizes)
}

vec_count_runs <- function(x) {
  vec_count(vec_identify_runs(x), sort = "location")$count
}

Giving us this for the original example:

x <- c("foo", "foo", "quux", "baz", "baz", "foo", "bar")

vec_count_runs(x)
#> [1] 2 1 2 1 1

vec_partition(x, vec_count_runs(x))
#> [[1]]
#> [1] "foo" "foo"
#> 
#> [[2]]
#> [1] "quux"
#> 
#> [[3]]
#> [1] "baz" "baz"
#> 
#> [[4]]
#> [1] "foo"
#> 
#> [[5]]
#> [1] "bar"

And this for Hadley's case:

xs <- list("a", c("b", "c"), c("d", "e", "f"))

x <- list_unchop(xs)
x <- toupper(x)

sizes <- list_sizes(xs)
vec_partition(x, sizes)
#> [[1]]
#> [1] "A"
#> 
#> [[2]]
#> [1] "B" "C"
#> 
#> [[3]]
#> [1] "D" "E" "F"

There is probably still something useful about vec_locate_runs(), in the same way that there is something useful about vec_locate_unique(). Being able to vec_slice() the start/end of the run is likely very useful. But I think using counts for this chop operation is probably best.

If vec_partition() is potentially going to be used as follow up to list_unchop(), consider that it may be useful to call it something with "chop" in the name for symmetry. Like vec_chop_runs() or vec_chop_sizes().

DavisVaughan avatar Oct 04 '22 13:10 DavisVaughan