dplyr
dplyr copied to clipboard
`nest_join()` upgrades
In some recent exploration of nest_join(), I've decided that it is lacking some features that would make it more powerful and generic. They are:
-
nest = c("y", "x", "both"), an argument that would decide which inputs should be nested. Right now, we can only get the behavior ofnest = "y", butnest = "both"is particularly useful to replace something like thisleft_join(group_nest(x, g), group_nest(y, g), by = g). It would interact with thename = NULLargument.nest = "y"ornest = "x":name = NULLautonames based on the relevant input, otherwise can be a single stringnest = "both":name = NULLautonames based on both inputs, otherwise must be a character vector of length 2
-
type = c("left", "right", "inner", "full"), an argument that describes which type of join you'd like to do. Right now,nest_join()really only does the semantics of a left-join (keep all rows from x, drop unmatched rows from y). I think it would be pretty nice to be able to do all types of joins.
Context is with ivs, where it is often useful to apply something like iv_set_intersect() between columns across multiple tables and by key. You need to be able to do nest = "both" to get two list-columns, and then you can apply rowwise() %>% summarise(iv = iv_set_intersect(x$iv, y$iv)) to that to apply the function per key.
This is the general idea that I want to make easier
library(dplyr)
library(ivs)
library(clock)
needles <- tibble(
g = c(1, 1, 1, 1, 2, 2, 2),
lower = date_build(2019, 1, c(1, 5, 10, 12, 2, 8, 9)),
upper = date_build(2019, 1, c(3, 7, 11, 14, 3, 11, 10))
) %>%
mutate(range = iv(lower, upper), .keep = "unused")
haystack <- tibble(
g = c(1, 1, 1, 2, 2),
lower = date_build(2019, 1, c(2, 4, 12, 6, 14)),
upper = date_build(2019, 1, c(8, 11, 15, 9, 16))
) %>%
mutate(range = iv(lower, upper), .keep = "unused")
needles
#> # A tibble: 7 × 2
#> g range
#> <dbl> <iv<date>>
#> 1 1 [2019-01-01, 2019-01-03)
#> 2 1 [2019-01-05, 2019-01-07)
#> 3 1 [2019-01-10, 2019-01-11)
#> 4 1 [2019-01-12, 2019-01-14)
#> 5 2 [2019-01-02, 2019-01-03)
#> 6 2 [2019-01-08, 2019-01-11)
#> 7 2 [2019-01-09, 2019-01-10)
haystack
#> # A tibble: 5 × 2
#> g range
#> <dbl> <iv<date>>
#> 1 1 [2019-01-02, 2019-01-08)
#> 2 1 [2019-01-04, 2019-01-11)
#> 3 1 [2019-01-12, 2019-01-15)
#> 4 2 [2019-01-06, 2019-01-09)
#> 5 2 [2019-01-14, 2019-01-16)
nested_needles <- needles %>%
group_by(g) %>%
group_nest(.key = "needles")
nested_haystack <- haystack %>%
group_by(g) %>%
group_nest(.key = "haystack")
nested_needles
#> # A tibble: 2 × 2
#> g needles
#> <dbl> <list<tibble[,1]>>
#> 1 1 [4 × 1]
#> 2 2 [3 × 1]
nested_haystack
#> # A tibble: 2 × 2
#> g haystack
#> <dbl> <list<tibble[,1]>>
#> 1 1 [3 × 1]
#> 2 2 [2 × 1]
left_join(nested_needles, nested_haystack, by = join_by(g)) %>%
rowwise(g) %>%
summarise(range = iv_intersect(needles$range, haystack$range), .groups = "drop")
#> # A tibble: 5 × 2
#> g range
#> <dbl> <iv<date>>
#> 1 1 [2019-01-02, 2019-01-03)
#> 2 1 [2019-01-05, 2019-01-07)
#> 3 1 [2019-01-10, 2019-01-11)
#> 4 1 [2019-01-12, 2019-01-14)
#> 5 2 [2019-01-08, 2019-01-09)
Created on 2022-10-11 with reprex v2.0.2.9000
Another use case for nest = "both" here
https://posit.co/blog/geospatial-distributed-processing-with-furrr/
Towards the end:
# Split Data by Province
# Postal codes -- points
can_pcodes_split <- can_postal_codes %>%
dplyr::filter(PROV != "Non Canadian") %>%
dplyr::group_by(PROV) %>%
dplyr::group_split()
# Economic regions -- polygons
can_eco_regions_split <- can_eco_regions %>%
dplyr::group_by(PROV) %>%
dplyr::group_split()
# Distribute the processing
future::plan(multisession, workers = 8)
furrr::future_map2(can_pcodes_split, can_eco_regions_split, sf::st_intersection) %>%
dplyr::bind_rows()
Which I think looks like this with nest_join()
# Split Data by Province
# Postal codes -- points
can_postal_codes <- can_postal_codes %>%
dplyr::filter(PROV != "Non Canadian")
dplyr::nest_join(can_postal_codes, can_eco_regions, join_by(PROV), which = "both", name = c("x", "y")) %>%
dplyr::rowwise() %>%
dplyr::reframe(sf::st_intersection(x, y))
# Or if you want to use furrr still
dplyr::nest_join(can_postal_codes, can_eco_regions, join_by(PROV), which = "both", name = c("x", "y")) %>%
dplyr::reframe({
out <- furrr::future_map2(x, y, sf::st_intersection)
purrr::list_rbind(out)
})