funneljoin icon indicating copy to clipboard operation
funneljoin copied to clipboard

Should the funnel join only operate within groups?

Open tmastny opened this issue 6 years ago • 2 comments

Hi Emily, thanks for creating the package!

What if the funnel join only matched events within groups? The reason I ask is because in my domain, I'm more interested in funnels with deadlines, rather than funnels with gaps: https://timmastny.rbind.io/blog/funnel-charts-funneljoin-gaps-deadlines/

Let me give you an example. Let's say I only want to count the firstafter event if it occurs in the same calendar week.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(funneljoin)
#> 
#> Attaching package: 'funneljoin'
#> The following object is masked from 'package:stats':
#> 
#>     filter

logs <- tribble(
  ~date, ~event,
  "2020-01-06", "upload",
  "2020-01-08", "print",
  "2020-01-13", "upload",
  "2020-01-20", "print",
  "2020-01-21", "upload"
) %>%
  mutate(date = as.Date(date)) %>%
  mutate(user = 1)

Following the business constraint about the same week deadline, the "2020-01-13" "upload" should not convert to a "print" because there is no firstafter within the same calendar week.

I hoped this might work:

logs %>%
  mutate(deadline = floor_date(date, "week")) %>%
  group_by(deadline) %>%
  funnel_start(
    moment_type = "upload",
    moment = "event",
    tstamp = "date",
    user = "user"
  ) %>%
  funnel_step(
    moment_type = "print",
    type = "any-firstafter"
  )
#> Adding missing grouping variables: `deadline_upload`
#> # A tibble: 12 x 6
#> # Groups:   deadline_upload.x [3]
#>    date_upload  user deadline_upload… deadline_upload… date_print
#>    <date>      <dbl> <date>           <date>           <date>    
#>  1 2020-01-06      1 2020-01-05       2020-01-05       2020-01-08
#>  2 2020-01-06      1 2020-01-05       2020-01-05       2020-01-20
#>  3 2020-01-06      1 2020-01-05       2020-01-12       2020-01-08
#>  4 2020-01-06      1 2020-01-05       2020-01-12       2020-01-20
#>  5 2020-01-13      1 2020-01-12       2020-01-05       2020-01-08
#>  6 2020-01-13      1 2020-01-12       2020-01-05       2020-01-20
#>  7 2020-01-13      1 2020-01-12       2020-01-12       2020-01-08
#>  8 2020-01-13      1 2020-01-12       2020-01-12       2020-01-20
#>  9 2020-01-21      1 2020-01-19       2020-01-05       2020-01-08
#> 10 2020-01-21      1 2020-01-19       2020-01-05       2020-01-20
#> 11 2020-01-21      1 2020-01-19       2020-01-12       2020-01-08
#> 12 2020-01-21      1 2020-01-19       2020-01-12       2020-01-20
#> # … with 1 more variable: deadline_print <date>

My thought was that since I did group_by(deadline), it would only join on the firstafter events within the same deadline value. Unfortunately that's not the case because date_upload == "2020-01-06" is being joined to date_print == "2020-01-20".

In fact, I'm not sure what's it is doing 😬. Maybe it's useful in some other way.

Here's my work-around:

logs %>%
  mutate(deadline = floor_date(date, "week")) %>%
  nest(events = -deadline) %>%
  mutate(conversions = map(
    events, 
    ~funnel_start(
      .,
      moment_type = "upload",
      moment = "event",
      tstamp = "date",
      user = "user"
    ) %>%
      funnel_step(
        moment_type = "print",
        type = "any-firstafter"
      ) %>%
        summarize_conversions(date_print)
  )) %>%
  unnest(conversions) %>%
  select(-events)
#> # A tibble: 3 x 4
#>   deadline   nb_users nb_conversions pct_converted
#>   <date>        <int>          <int>         <dbl>
#> 1 2020-01-05        1              1             1
#> 2 2020-01-12        1              0             0
#> 3 2020-01-19        1              0             0

Created on 2019-11-09 by the reprex package (v0.2.1)

It's not bad, but a little messy.

What are your thoughts with funnel_start and funnel_step[s] only matching events within the group? Is it even possible? Does it match how you think it should work?

tmastny avatar Nov 09 '19 21:11 tmastny

Hi @tmastny - this should be solved when pull request #41 is merged (waiting on review from @dgrtwo), which enables you to specify multiple "user" columns to join on. For your case, you would do:

 logs %>%
    funnel_start(moment_type = "upload", moment = "event",
                 tstamp = "date", user = c("deadline", "user")) %>%
    funnel_step(moment_type = "print", type = "any-firstafter")

And get back:

# A tibble: 3 x 3
  date_upload deadline_user date_print
  <date>      <chr>         <date>    
1 2020-01-06  2020-01-05_1  2020-01-08
2 2020-01-13  2020-01-12_1  NA        
3 2020-01-21  2020-01-19_1  NA        

One pending question is whether that table would have the combined deadline_user column, as it does now, or if we'd separate them back out into two columns.

robinsones avatar Jan 21 '20 20:01 robinsones

Also, thanks for writing your blog posts!

robinsones avatar Jan 21 '20 20:01 robinsones