Should the funnel join only operate within groups?
Hi Emily, thanks for creating the package!
What if the funnel join only matched events within groups? The reason I ask is because in my domain, I'm more interested in funnels with deadlines, rather than funnels with gaps: https://timmastny.rbind.io/blog/funnel-charts-funneljoin-gaps-deadlines/
Let me give you an example. Let's say I only want to count the firstafter event if it occurs in the same calendar week.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(funneljoin)
#>
#> Attaching package: 'funneljoin'
#> The following object is masked from 'package:stats':
#>
#> filter
logs <- tribble(
~date, ~event,
"2020-01-06", "upload",
"2020-01-08", "print",
"2020-01-13", "upload",
"2020-01-20", "print",
"2020-01-21", "upload"
) %>%
mutate(date = as.Date(date)) %>%
mutate(user = 1)
Following the business constraint about the same week deadline, the "2020-01-13" "upload" should not convert to a "print" because there is no firstafter within the same calendar week.
I hoped this might work:
logs %>%
mutate(deadline = floor_date(date, "week")) %>%
group_by(deadline) %>%
funnel_start(
moment_type = "upload",
moment = "event",
tstamp = "date",
user = "user"
) %>%
funnel_step(
moment_type = "print",
type = "any-firstafter"
)
#> Adding missing grouping variables: `deadline_upload`
#> # A tibble: 12 x 6
#> # Groups: deadline_upload.x [3]
#> date_upload user deadline_upload… deadline_upload… date_print
#> <date> <dbl> <date> <date> <date>
#> 1 2020-01-06 1 2020-01-05 2020-01-05 2020-01-08
#> 2 2020-01-06 1 2020-01-05 2020-01-05 2020-01-20
#> 3 2020-01-06 1 2020-01-05 2020-01-12 2020-01-08
#> 4 2020-01-06 1 2020-01-05 2020-01-12 2020-01-20
#> 5 2020-01-13 1 2020-01-12 2020-01-05 2020-01-08
#> 6 2020-01-13 1 2020-01-12 2020-01-05 2020-01-20
#> 7 2020-01-13 1 2020-01-12 2020-01-12 2020-01-08
#> 8 2020-01-13 1 2020-01-12 2020-01-12 2020-01-20
#> 9 2020-01-21 1 2020-01-19 2020-01-05 2020-01-08
#> 10 2020-01-21 1 2020-01-19 2020-01-05 2020-01-20
#> 11 2020-01-21 1 2020-01-19 2020-01-12 2020-01-08
#> 12 2020-01-21 1 2020-01-19 2020-01-12 2020-01-20
#> # … with 1 more variable: deadline_print <date>
My thought was that since I did group_by(deadline), it would only join on the firstafter events within the same deadline value. Unfortunately that's not the case because date_upload == "2020-01-06" is being joined to date_print == "2020-01-20".
In fact, I'm not sure what's it is doing 😬. Maybe it's useful in some other way.
Here's my work-around:
logs %>%
mutate(deadline = floor_date(date, "week")) %>%
nest(events = -deadline) %>%
mutate(conversions = map(
events,
~funnel_start(
.,
moment_type = "upload",
moment = "event",
tstamp = "date",
user = "user"
) %>%
funnel_step(
moment_type = "print",
type = "any-firstafter"
) %>%
summarize_conversions(date_print)
)) %>%
unnest(conversions) %>%
select(-events)
#> # A tibble: 3 x 4
#> deadline nb_users nb_conversions pct_converted
#> <date> <int> <int> <dbl>
#> 1 2020-01-05 1 1 1
#> 2 2020-01-12 1 0 0
#> 3 2020-01-19 1 0 0
Created on 2019-11-09 by the reprex package (v0.2.1)
It's not bad, but a little messy.
What are your thoughts with funnel_start and funnel_step[s] only matching events within the group? Is it even possible? Does it match how you think it should work?
Hi @tmastny - this should be solved when pull request #41 is merged (waiting on review from @dgrtwo), which enables you to specify multiple "user" columns to join on. For your case, you would do:
logs %>%
funnel_start(moment_type = "upload", moment = "event",
tstamp = "date", user = c("deadline", "user")) %>%
funnel_step(moment_type = "print", type = "any-firstafter")
And get back:
# A tibble: 3 x 3
date_upload deadline_user date_print
<date> <chr> <date>
1 2020-01-06 2020-01-05_1 2020-01-08
2 2020-01-13 2020-01-12_1 NA
3 2020-01-21 2020-01-19_1 NA
One pending question is whether that table would have the combined deadline_user column, as it does now, or if we'd separate them back out into two columns.
Also, thanks for writing your blog posts!