timescaledb-toolkit icon indicating copy to clipboard operation
timescaledb-toolkit copied to clipboard

Funnels

Open epgts opened this issue 2 years ago • 1 comments

  • Current status: prototype branch - doc
  • Effort remaining: lots

Purpose

TODO

Next steps

Next step is to resolve these questions and publish a first experimental function (or functions).

Questions

What use cases do we want to support?

Need more discussion with customers.

Is storing an INTEGER handle right for those? No. Aggregate should only cover event and time, with users (or sessions) coming from GROUP BY (see David's example above). At that point this looks a lot like a timevector, and the duplication is a shame (but it only supports floats...).

Which use case do we want to start with?

within_interval is just off the top of epg's head. What does a useful interface look like? David offered an idea above.

Ensure we're not duplicating efforts of other teams such as promscale

Use cases

TODO - the examples below are just speculative.

Test Data

Examples below are tested against the following data:

DROP TABLE IF EXISTS funnel_test;
CREATE TABLE funnel_test(
	"time" TIMESTAMPTZ,
	"user" INTEGER,
	"event" TEXT);
INSERT INTO funnel_test VALUES
	('2020-01-01 00:00:00+00', 1, 'LOGIN'),
	('2020-01-02 00:00:00+00', 1, 'LOGIN'),
	('2020-01-02 01:00:00+00', 1, 'SEARCH'),
	('2020-01-02 03:00:00+00', 1, 'BUY'),

	('2020-01-03 00:00:00+00', 2, 'LOGIN'),
	('2020-01-03 01:00:00+00', 2, 'SEARCH'),
	('2020-01-03 01:30:00+00', 2, 'BUY'),
	('2020-01-20 00:00:00+00', 2, 'LOGIN'),

	('2020-02-01 01:00:00+00', 3, 'SEARCH'),
	('2020-02-14 01:30:00+00', 3, 'BUY'),

	('2020-02-15 00:00:00+00', 4, 'LOGIN'),
	('2020-02-28 00:00:00+00', 4, 'LOGIN');

two LOGIN events within one week

SELECT a.user
FROM funnel_test AS a
    JOIN funnel_test AS b
    ON a.user = b.user
WHERE
    a.event = 'LOGIN'
    AND b.event = 'LOGIN'
    AND a.time <> b.time
    AND a.time - b.time > '0 week'::interval
    AND a.time - b.time < '1 week'::interval
;
 userid
--------
      1
SELECT toolkit_experimental.within_interval(
	'LOGIN', 'LOGIN', '1 week',
	toolkit_experimental.funnel_agg("user", event, time))
	FROM funnel_test;
 userid
--------
      1

two LOGIN events in a row

WITH t AS (
    SELECT *, row_number() OVER (PARTITION BY f.user ORDER BY time)
    FROM funnel_test AS f
)
SELECT a.user
FROM t AS a
    JOIN t AS b ON a.user = b.user
WHERE
    a.event = 'LOGIN'
    AND b.event = 'LOGIN'
    AND a.row_number - b.row_number > 0
    AND a.row_number - b.row_number <= 1
;
 user
------
    1
    4
SELECT toolkit_experimental.consecutive(
	'LOGIN', 'LOGIN',
	toolkit_experimental.funnel_agg("user", event, time))
	userid
	FROM funnel_test
	ORDER BY userid;
 userid
--------
      1
      4

BUY event within one hour of SEARCH

SELECT toolkit_experimental.within_interval(
	'SEARCH', 'BUY', '1 hour',
	toolkit_experimental.funnel_agg("user", event, time))
	FROM funnel_test;
 within_interval
-----------------
               2

BUY event immediately following SEARCH

SELECT toolkit_experimental.consecutive(
	'SEARCH', 'BUY',
	toolkit_experimental.funnel_agg("user", event, time))
	userid
	FROM funnel_test
	ORDER BY userid;
 userid
--------
      1
      2
      3

from David

https://github.com/timescale/timescaledb-toolkit/pull/474#discussion_r925692902

CREATE MATERIALIZED VIEW funnels AS SELECT user, toolkit_experimental.funnel_agg(event, time) funnel
FROM funnel_test
GROUP BY user;

-- Then we can search with the WHERE clause:

SELECT * FROM funnels p

WHERE funnel ? event('LOGIN') -> event('SEARCH', within => '1 hour'::interval)
OR funnel ? event('LOGIN') -> event('LOGIN', within => '1 hour'::interval, consecutive => true);

into_values

SELECT handle, event, time FROM
	toolkit_experimental.into_values(
		(SELECT toolkit_experimental.funnel_agg("user", event, time) FROM funnel_test))
	ORDER BY time, handle, event;
 handle |  event | time
--------+--------+------------------------
      1 |  LOGIN | 2020-01-01 00:00:00+00
      1 |  LOGIN | 2020-01-02 00:00:00+00
      1 | SEARCH | 2020-01-02 01:00:00+00
      1 |    BUY | 2020-01-02 03:00:00+00
      2 |  LOGIN | 2020-01-03 00:00:00+00
      2 | SEARCH | 2020-01-03 01:00:00+00
      2 |    BUY | 2020-01-03 01:30:00+00
      2 |  LOGIN | 2020-01-20 00:00:00+00
      3 | SEARCH | 2020-02-01 01:00:00+00
      3 |    BUY | 2020-02-14 01:30:00+00
      4 |  LOGIN | 2020-02-15 00:00:00+00
      4 |  LOGIN | 2020-02-28 00:00:00+00

Implementation

Current status

This is just an exploratory prototype.

Time to implement a new function over FunnelAggregate is on the order of a week. If FunnelAggregate is insufficient as is, time to implement goes up.

Currently requires rows to appear ordered by time.

Performance (aspirational)

What requirements do we have?

Space

Current implementation stores 16 + E + 24N bytes where N is the number of events and E is the number of bytes required to store one copy of each event name.

However, building the aggregate currently requires in memory a separate copy of each event name per event (plus String and padding overhead) as well as two more copies of the unique event names only (i.e. 2E), plus the overhead of one std::collections::HashMap.

There are tricks we can use to reduce the transient memory footprint, but that gets pretty hairy, and I think we can reduce the footprint a lot more by using event id rather than name. See "Event id" under "Alternatives", below.

Alternatives

Event id rather than name

How much burden is this on the user? If they have a table like this:

DROP TABLE IF EXISTS funnel_events;
CREATE TABLE funnel_events(
	"id" INTEGER,
	"name" TEXT);
INSERT INTO funnel_events VALUES
	(1, 'LOGIN'),
	(2, 'BUY');

A query like this seems to work (I didn't try it out all the way, just hacked what I had to accept INTEGER instead of TEXT with hard-coded result and that much at least worked!):

SELECT toolkit_experimental.within_interval(
	(SELECT id FROM funnel_events WHERE name = 'LOGIN'),
	(SELECT id FROM funnel_events WHERE name = 'LOGIN'),
	'1 week',
	toolkit_experimental.funnel_agg2(
		"user",
		(SELECT id FROM funnel_events WHERE name = event),
		time))
	FROM funnel_test;
 within_interval
-----------------
               1

If we're only looking at processing 1 million rows, we're only talking about savings of megabytes, which is probably not worth the effort nowadays. But if we're talking about billions of rows and reducing 10s of gigabytes to only a few gigabytes, that starts to sound worth it.

I implemented this and tracked the resident set size of this and the strings version of the program with 10 million rows of data and found the int form to consume only 20% the memory the strings form consumes:

  • 3136 MB for strings
  • 641 MB for ints

epgts avatar Jul 20 '22 18:07 epgts

To understand what is currently achievable via SQL (and to what extent we'd be improving the user experience), here's my first guess at how you could accomplish this currently.

  • within_interval('LOGIN', 'LOGIN', '1 week'):
SELECT a.user 
FROM funnel_test AS a
    JOIN funnel_test AS b
    ON a.user = b.user
WHERE
    a.event = 'LOGIN'
    AND b.event = 'LOGIN'
    AND a.time <> b.time
    AND a.time - b.time > '0 week'::interval
    AND a.time - b.time < '1 week'::interval
;
  • consecutive('LOGIN', 'LOGIN'):
WITH t AS (
    SELECT *, lead(event) OVER (PARTITION BY f.user ORDER BY time) AS next_event
    FROM funnel_test AS f
)
SELECT t.user
FROM t 
WHERE event = 'LOGIN'
    AND next_event = 'LOGIN'
;

(anyone should feel free to edit this comment if they have improvements upon these queries)

I think that if the within_interval and consecutive functions could accept larger sequences of events, they'd become clearer improvements. With the approach above, in order to capture an N length sequence, N-1 self joins are required. I like the composability of event(...)s idea David had.

rtwalker avatar Jul 20 '22 20:07 rtwalker