fuzzyjoin icon indicating copy to clipboard operation
fuzzyjoin copied to clipboard

Matching does not work with list-columns?

Open turbanisch opened this issue 2 years ago • 0 comments

I was trying to match dates to intervals, first as a toy example with integers, then with dates. Using integer ranges instead of date intervals requires me to make use of list-columns in the first tibble and the joined dataframe contains no rows. When using actual dates (i.e., without list-columns), everything works as expected and the joined dataframe is not empty.

Is it correct that the list-column is causing this behavior (or is my code flawed)? If so, it would be great if either this functionality could be enabled or if a warning would be shown, I think.

library(tidyverse)
library(lubridate, warn.conflicts = FALSE)
library(fuzzyjoin)

# set up with list column
df_int <- tibble(
  video = c("A", "B", "B"),
  promo_interval = list(1:4, 3:5, 7:9)
)

df_dates <- tibble(
  video = c("A", "A", "A", "B"),
  views = c(234, 235, 166, 435),
  date = c(2, 3, 7, 7)
)

# join yields empty dataframe
fuzzy_inner_join(
  df_dates,
  df_int, 
  by = c("video", "date" = "promo_interval"),
  match_fun = list(`==`, `%in%`)
)
#> # A tibble: 0 × 5
#> # … with 5 variables: video.x <chr>, views <dbl>, date <dbl>, video.y <chr>,
#> #   promo_interval <list>

# with dates/intervals (no list-column)
df_int <- tibble(
  video = c("A", "B", "B"),
  promo_interval = c(interval("20220501", "20220504"), 
                     interval("20220503", "20220505"), 
                     interval("20220507", "20220509"))
)

df_dates <- tibble(
  video = c("A", "A", "A", "B"),
  views = c(234, 235, 166, 435),
  date = c(ymd("20220502"), 
           ymd("20220503"), 
           ymd("20220507"), 
           ymd("20220507"))
)

# join results as expected
fuzzy_inner_join(
  df_dates,
  df_int, 
  by = c("video", "date" = "promo_interval"),
  match_fun = list(`==`, `%within%`)
)
#> # A tibble: 3 × 5
#>   video.x views date       video.y promo_interval                
#>   <chr>   <dbl> <date>     <chr>   <Interval>                    
#> 1 A         234 2022-05-02 A       2022-05-01 UTC--2022-05-04 UTC
#> 2 A         235 2022-05-03 A       2022-05-01 UTC--2022-05-04 UTC
#> 3 B         435 2022-05-07 B       2022-05-07 UTC--2022-05-09 UTC

Created on 2022-05-28 by the reprex package (v2.0.1)

turbanisch avatar May 28 '22 11:05 turbanisch