dplyr icon indicating copy to clipboard operation
dplyr copied to clipboard

`nest_join()` upgrades

Open DavisVaughan opened this issue 3 years ago • 1 comments

In some recent exploration of nest_join(), I've decided that it is lacking some features that would make it more powerful and generic. They are:

  • nest = c("y", "x", "both"), an argument that would decide which inputs should be nested. Right now, we can only get the behavior of nest = "y", but nest = "both" is particularly useful to replace something like this left_join(group_nest(x, g), group_nest(y, g), by = g). It would interact with the name = NULL argument.

    • nest = "y" or nest = "x": name = NULL autonames based on the relevant input, otherwise can be a single string
    • nest = "both": name = NULL autonames based on both inputs, otherwise must be a character vector of length 2
  • type = c("left", "right", "inner", "full"), an argument that describes which type of join you'd like to do. Right now, nest_join() really only does the semantics of a left-join (keep all rows from x, drop unmatched rows from y). I think it would be pretty nice to be able to do all types of joins.


Context is with ivs, where it is often useful to apply something like iv_set_intersect() between columns across multiple tables and by key. You need to be able to do nest = "both" to get two list-columns, and then you can apply rowwise() %>% summarise(iv = iv_set_intersect(x$iv, y$iv)) to that to apply the function per key.

This is the general idea that I want to make easier

library(dplyr)
library(ivs)
library(clock)

needles <- tibble(
  g = c(1, 1, 1, 1, 2, 2, 2),
  lower = date_build(2019, 1, c(1, 5, 10, 12, 2, 8, 9)),
  upper = date_build(2019, 1, c(3, 7, 11, 14, 3, 11, 10))
) %>%
  mutate(range = iv(lower, upper), .keep = "unused")

haystack <- tibble(
  g = c(1, 1, 1, 2, 2),
  lower = date_build(2019, 1, c(2, 4, 12, 6, 14)),
  upper = date_build(2019, 1, c(8, 11, 15, 9, 16))
) %>%
  mutate(range = iv(lower, upper), .keep = "unused")

needles
#> # A tibble: 7 × 2
#>       g                    range
#>   <dbl>               <iv<date>>
#> 1     1 [2019-01-01, 2019-01-03)
#> 2     1 [2019-01-05, 2019-01-07)
#> 3     1 [2019-01-10, 2019-01-11)
#> 4     1 [2019-01-12, 2019-01-14)
#> 5     2 [2019-01-02, 2019-01-03)
#> 6     2 [2019-01-08, 2019-01-11)
#> 7     2 [2019-01-09, 2019-01-10)
haystack
#> # A tibble: 5 × 2
#>       g                    range
#>   <dbl>               <iv<date>>
#> 1     1 [2019-01-02, 2019-01-08)
#> 2     1 [2019-01-04, 2019-01-11)
#> 3     1 [2019-01-12, 2019-01-15)
#> 4     2 [2019-01-06, 2019-01-09)
#> 5     2 [2019-01-14, 2019-01-16)

nested_needles <- needles %>%
  group_by(g) %>%
  group_nest(.key = "needles")

nested_haystack <- haystack %>%
  group_by(g) %>%
  group_nest(.key = "haystack")

nested_needles
#> # A tibble: 2 × 2
#>       g            needles
#>   <dbl> <list<tibble[,1]>>
#> 1     1            [4 × 1]
#> 2     2            [3 × 1]
nested_haystack
#> # A tibble: 2 × 2
#>       g           haystack
#>   <dbl> <list<tibble[,1]>>
#> 1     1            [3 × 1]
#> 2     2            [2 × 1]

left_join(nested_needles, nested_haystack, by = join_by(g)) %>%
  rowwise(g) %>%
  summarise(range = iv_intersect(needles$range, haystack$range), .groups = "drop")
#> # A tibble: 5 × 2
#>       g                    range
#>   <dbl>               <iv<date>>
#> 1     1 [2019-01-02, 2019-01-03)
#> 2     1 [2019-01-05, 2019-01-07)
#> 3     1 [2019-01-10, 2019-01-11)
#> 4     1 [2019-01-12, 2019-01-14)
#> 5     2 [2019-01-08, 2019-01-09)

Created on 2022-10-11 with reprex v2.0.2.9000

DavisVaughan avatar Oct 11 '22 13:10 DavisVaughan

Another use case for nest = "both" here https://posit.co/blog/geospatial-distributed-processing-with-furrr/

Towards the end:

# Split Data by Province
# Postal codes -- points
can_pcodes_split <- can_postal_codes %>%
  dplyr::filter(PROV != "Non Canadian") %>%
  dplyr::group_by(PROV) %>%
  dplyr::group_split()

# Economic regions -- polygons
can_eco_regions_split <- can_eco_regions %>%
  dplyr::group_by(PROV) %>%
  dplyr::group_split()

# Distribute the processing
future::plan(multisession, workers = 8)

furrr::future_map2(can_pcodes_split, can_eco_regions_split, sf::st_intersection) %>%
  dplyr::bind_rows()

Which I think looks like this with nest_join()

# Split Data by Province
# Postal codes -- points
can_postal_codes <- can_postal_codes %>%
  dplyr::filter(PROV != "Non Canadian")

dplyr::nest_join(can_postal_codes, can_eco_regions, join_by(PROV), which = "both", name = c("x", "y")) %>%
  dplyr::rowwise() %>%
  dplyr::reframe(sf::st_intersection(x, y))

# Or if you want to use furrr still
dplyr::nest_join(can_postal_codes, can_eco_regions, join_by(PROV), which = "both", name = c("x", "y")) %>%
  dplyr::reframe({
    out <- furrr::future_map2(x, y, sf::st_intersection)
    purrr::list_rbind(out)
  })

DavisVaughan avatar Jan 23 '23 20:01 DavisVaughan