doris icon indicating copy to clipboard operation
doris copied to clipboard

[Feature] About supporting sequenceCount function and retention function

Open FreeOnePlus opened this issue 2 years ago • 5 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Description

In ClickHouse, support for sequenceCount function and retention function is provided, but Doris currently does not have these two functions, and hopes to gain support

Use case

The following are the explanations and use cases of the two functions in ClickHouse

sequenceCount(pattern)(time, cond1, cond2, …)

Counts the number of event chains that matched the pattern. The function searches event chains that do not overlap. It starts to search for the next chain after the current chain is matched.

WARNING Events that occur at the same second may lay in the sequence in an undefined order affecting the result.

sequenceCount(pattern)(timestamp, cond1, cond2, ...)

Arguments

timestamp — Column considered to contain time data. Typical data types are Date and DateTime. You can also use any of the supported UInt data types.

cond1, cond2 — Conditions that describe the chain of events. Data type: UInt8. You can pass up to 32 condition arguments. The function takes only the events described in these conditions into account. If the sequence contains data that isn’t described in a condition, the function skips them.

Parameters

pattern — Pattern string. See Pattern syntax. Returned values

Number of non-overlapping event chains that are matched. Type: UInt64.

Example

Consider data in the t table:

┌─time─┬─number─┐ │ 1 │ 1 │ │ 2 │ 3 │ │ 3 │ 2 │ │ 4 │ 1 │ │ 5 │ 3 │ │ 6 │ 2 │ └──────┴────────┘

Count how many times the number 2 occurs after the number 1 with any amount of other numbers between them:

SELECT sequenceCount('(?1).*(?2)')(time, number = 1, number = 2) FROM t

┌─sequenceCount('(?1).*(?2)')(time, equals(number, 1), equals(number, 2))─┐ │ 2 │ └─────────────────────────────────────────────────────────────────────────┘

retention function

The function takes as arguments a set of conditions from 1 to 32 arguments of type UInt8 that indicate whether a certain condition was met for the event. Any condition can be specified as an argument (as in WHERE).

The conditions, except the first, apply in pairs: the result of the second will be true if the first and second are true, of the third if the first and third are true, etc.

Syntax

retention(cond1, cond2, ..., cond32);

Arguments

cond — An expression that returns a UInt8 result (1 or 0). Returned value

The array of 1 or 0.

1 — Condition was met for the event. 0 — Condition wasn’t met for the event. Type: UInt8.

Example

Let’s consider an example of calculating the retention function to determine site traffic.

  1. Сreate a table to illustrate an example.

CREATE TABLE retention_test(date Date, uid Int32) ENGINE = Memory;

INSERT INTO retention_test SELECT '2020-01-01', number FROM numbers(5); INSERT INTO retention_test SELECT '2020-01-02', number FROM numbers(10); INSERT INTO retention_test SELECT '2020-01-03', number FROM numbers(15);

Input table:

Query:

SELECT * FROM retention_test

Result:

┌───────date─┬─uid─┐ │ 2020-01-01 │ 0 │ │ 2020-01-01 │ 1 │ │ 2020-01-01 │ 2 │ │ 2020-01-01 │ 3 │ │ 2020-01-01 │ 4 │ └────────────┴─────┘ ┌───────date─┬─uid─┐ │ 2020-01-02 │ 0 │ │ 2020-01-02 │ 1 │ │ 2020-01-02 │ 2 │ │ 2020-01-02 │ 3 │ │ 2020-01-02 │ 4 │ │ 2020-01-02 │ 5 │ │ 2020-01-02 │ 6 │ │ 2020-01-02 │ 7 │ │ 2020-01-02 │ 8 │ │ 2020-01-02 │ 9 │ └────────────┴─────┘ ┌───────date─┬─uid─┐ │ 2020-01-03 │ 0 │ │ 2020-01-03 │ 1 │ │ 2020-01-03 │ 2 │ │ 2020-01-03 │ 3 │ │ 2020-01-03 │ 4 │ │ 2020-01-03 │ 5 │ │ 2020-01-03 │ 6 │ │ 2020-01-03 │ 7 │ │ 2020-01-03 │ 8 │ │ 2020-01-03 │ 9 │ │ 2020-01-03 │ 10 │ │ 2020-01-03 │ 11 │ │ 2020-01-03 │ 12 │ │ 2020-01-03 │ 13 │ │ 2020-01-03 │ 14 │ └────────────┴─────┘

  1. Group users by unique ID uid using the retention function.

Query:

SELECT uid, retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r FROM retention_test WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03') GROUP BY uid ORDER BY uid ASC

Result:

┌─uid─┬─r───────┐ │ 0 │ [1,1,1] │ │ 1 │ [1,1,1] │ │ 2 │ [1,1,1] │ │ 3 │ [1,1,1] │ │ 4 │ [1,1,1] │ │ 5 │ [0,0,0] │ │ 6 │ [0,0,0] │ │ 7 │ [0,0,0] │ │ 8 │ [0,0,0] │ │ 9 │ [0,0,0] │ │ 10 │ [0,0,0] │ │ 11 │ [0,0,0] │ │ 12 │ [0,0,0] │ │ 13 │ [0,0,0] │ │ 14 │ [0,0,0] │ └─────┴─────────┘

  1. Calculate the total number of site visits per day.

Query:

SELECT sum(r[1]) AS r1, sum(r[2]) AS r2, sum(r[3]) AS r3 FROM ( SELECT uid, retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r FROM retention_test WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03') GROUP BY uid )

Result:

┌─r1─┬─r2─┬─r3─┐ │ 5 │ 5 │ 5 │ └────┴────┴────┘

Where:

r1- the number of unique visitors who visited the site during 2020-01-01 (the cond1 condition). r2- the number of unique visitors who visited the site during a specific time period between 2020-01-01 and 2020-01-02 (cond1 and cond2 conditions). r3- the number of unique visitors who visited the site during a specific time period between 2020-01-01 and 2020-01-03 (cond1 and cond3 conditions).

Related issues

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

FreeOnePlus avatar Jun 28 '22 10:06 FreeOnePlus

[WeOpen-Star] I want to try it. please assign this issue to me~

isHuangXin avatar Aug 30 '22 11:08 isHuangXin

[WeOpen-Star] I want to try it. please assign this issue to me~

Hello handsome, my mentor asked me to try to implement the retention function, and I see thst you are trying to implement it. I think if you haven't implemented it yet, could you please give me this task (just Retention function)? thank you very much!

Yukang-Lian avatar Sep 21 '22 03:09 Yukang-Lian

Plz unassign me. Because doris does not support building the project in the M1 ARM programming environment. At that time, my computer is an M1 chip.

isHuangXin avatar Sep 21 '22 03:09 isHuangXin

Thank you very much! In fact, I am also using m1 macbook. I usually develop on macbook and compile on Alibaba Cloud server. This is my suggestion and I hope it can help you. 😄

Yukang-Lian avatar Sep 21 '22 03:09 Yukang-Lian

Thank you very much! In fact, I am also using m1 macbook. I usually develop on macbook and compile on Alibaba Cloud server. This is my suggestion and I hope it can help you. 😄

Get it. Thanks for your advice and I'll try that.

isHuangXin avatar Sep 21 '22 03:09 isHuangXin

retention is picked by a guy. https://github.com/apache/doris/issues/12800.

dataroaring avatar Sep 23 '22 06:09 dataroaring

retention function has been finished #13056 I will start to implement sequenceCount #13118 and sequenceMatch #13117 soon.

Yukang-Lian avatar Oct 04 '22 15:10 Yukang-Lian