funneljoin icon indicating copy to clipboard operation
funneljoin copied to clipboard

Some suggestions

Open ramnathv opened this issue 6 years ago • 3 comments

@robinsones First off, great package! It makes creating and analyzing funnels clean and easy. Based on my navigation of the package's API, I had some suggestions:

You might want to make landed and registered datasets in the package so users can get started with the examples without having to define them. This can be extended to define out-of-memory versions of the data by taking advantage of dbplyr::src_dbi.

The notion of a join type is great and factors in the multiple scenarios that one might run into. However, I had trouble visualizing the execution and how each type results in a different output compared to the pure version of the join function. So I put together a helper function that allows one to visualize the differences. Based on this, here is my understanding of the working of after_join:

  1. Join the tables using the regular version of {mode}_join
  2. Filter out all records where event_x occurs before event_y
  3. For each user_id only retain versions of events specified by type
    • any-any will retain all records in Step (2)
    • first-first, first-firstafter, and lastbefore-firstafter will retain only ONE record per user.

Is my understanding correct? It might be useful to add something like this to the documentation so it is really clear to users how these after_joins work.

after_join_all
after_join_all <- function(x, y, by, mode = 'inner', ...){
  types <- c(
    'first-first', 'first-firstafter', 'lastbefore-firstafter',
    'any-firstafter', 'any-any'
  )
  by_type <- function(type){
    after_join(x, y, ..., mode = mode, type = type) %>% 
      mutate(!!type := 'Y') 
  }
  join_fun <- match.fun(paste0(mode, '_join'))
  all_types <- types %>% 
    purrr::map(by_type)
  join_fun(x, y, by = by) %>% 
    Reduce(left_join, all_types, init = .)
}
user_id timestamp.x timestamp.y first-first first-firstafter lastbefore-firstafter any-firstafter any-any
1 2018-07-01 2018-07-02 :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
3 2018-07-02 2018-07-02 :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
4 2018-07-01 2018-06-10 NA NA NA NA NA
4 2018-07-01 2018-07-02 NA :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
4 2018-07-04 2018-06-10 NA NA NA NA NA
4 2018-07-04 2018-07-02 NA NA NA NA NA
5 2018-07-10 2018-07-11 :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
5 2018-07-12 2018-07-11 NA NA NA NA NA
6 2018-07-07 2018-07-10 :white_check_mark: :white_check_mark: NA :white_check_mark: :white_check_mark:
6 2018-07-07 2018-07-11 NA NA NA NA :white_check_mark:
6 2018-07-08 2018-07-10 NA NA :white_check_mark: :white_check_mark: :white_check_mark:
6 2018-07-08 2018-07-11 NA NA NA NA :white_check_mark:

ramnathv avatar May 12 '19 17:05 ramnathv

Thanks Ramnath! I'll work on these this week.

robinsones avatar May 13 '19 18:05 robinsones

@ramnathv from my understanding of this function, you'd want to specify by_user and by_time in it, and the by argument would be the same as the by_user. Is that right? If so, it seems simpler to replace by with by_user and by_time

How I reproduced your result:

landed <- tibble::tribble(
  ~user_id, ~timestamp,
  1, "2018-07-01",
  2, "2018-07-01",
  3, "2018-07-02",
  4, "2018-07-01",
  4, "2018-07-04",
  5, "2018-07-10",
  5, "2018-07-12",
  6, "2018-07-07",
  6, "2018-07-08"
) %>%
  mutate(timestamp = as.Date(timestamp))

registered <- tibble::tribble(
  ~user_id, ~timestamp,
  1, "2018-07-02",
  3, "2018-07-02",
  4, "2018-06-10",
  4, "2018-07-02",
  5, "2018-07-11",
  6, "2018-07-10",
  6, "2018-07-11",
  7, "2018-07-07"
) %>%
  mutate(timestamp = as.Date(timestamp))

after_join_all(x = landed, 
               y = registered, 
               by = "user_id",
               by_user = "user_id", 
               by_time = "timestamp") 

robinsones avatar May 23 '19 18:05 robinsones

@robinsones That will work! Note that I am proposing this only as an internal utility function to help make clear the distinction of the output produced by the different variations, since it is a little nuanced. It will only be useful when working with toy datasets.

ramnathv avatar May 23 '19 19:05 ramnathv