doris
doris copied to clipboard
[Feature] About supporting sequenceCount function and retention function
Search before asking
- [X] I had searched in the issues and found no similar issues.
Description
In ClickHouse, support for sequenceCount function and retention function is provided, but Doris currently does not have these two functions, and hopes to gain support
Use case
The following are the explanations and use cases of the two functions in ClickHouse
sequenceCount(pattern)(time, cond1, cond2, …)
Counts the number of event chains that matched the pattern. The function searches event chains that do not overlap. It starts to search for the next chain after the current chain is matched.
WARNING Events that occur at the same second may lay in the sequence in an undefined order affecting the result.
sequenceCount(pattern)(timestamp, cond1, cond2, ...)
Arguments
timestamp — Column considered to contain time data. Typical data types are Date and DateTime. You can also use any of the supported UInt data types.
cond1, cond2 — Conditions that describe the chain of events. Data type: UInt8. You can pass up to 32 condition arguments. The function takes only the events described in these conditions into account. If the sequence contains data that isn’t described in a condition, the function skips them.
Parameters
pattern — Pattern string. See Pattern syntax. Returned values
Number of non-overlapping event chains that are matched. Type: UInt64.
Example
Consider data in the t table:
┌─time─┬─number─┐ │ 1 │ 1 │ │ 2 │ 3 │ │ 3 │ 2 │ │ 4 │ 1 │ │ 5 │ 3 │ │ 6 │ 2 │ └──────┴────────┘
Count how many times the number 2 occurs after the number 1 with any amount of other numbers between them:
SELECT sequenceCount('(?1).*(?2)')(time, number = 1, number = 2) FROM t
┌─sequenceCount('(?1).*(?2)')(time, equals(number, 1), equals(number, 2))─┐ │ 2 │ └─────────────────────────────────────────────────────────────────────────┘
retention function
The function takes as arguments a set of conditions from 1 to 32 arguments of type UInt8 that indicate whether a certain condition was met for the event. Any condition can be specified as an argument (as in WHERE).
The conditions, except the first, apply in pairs: the result of the second will be true if the first and second are true, of the third if the first and third are true, etc.
Syntax
retention(cond1, cond2, ..., cond32);
Arguments
cond — An expression that returns a UInt8 result (1 or 0). Returned value
The array of 1 or 0.
1 — Condition was met for the event. 0 — Condition wasn’t met for the event. Type: UInt8.
Example
Let’s consider an example of calculating the retention function to determine site traffic.
- Сreate a table to illustrate an example.
CREATE TABLE retention_test(date Date, uid Int32) ENGINE = Memory;
INSERT INTO retention_test SELECT '2020-01-01', number FROM numbers(5); INSERT INTO retention_test SELECT '2020-01-02', number FROM numbers(10); INSERT INTO retention_test SELECT '2020-01-03', number FROM numbers(15);
Input table:
Query:
SELECT * FROM retention_test
Result:
┌───────date─┬─uid─┐ │ 2020-01-01 │ 0 │ │ 2020-01-01 │ 1 │ │ 2020-01-01 │ 2 │ │ 2020-01-01 │ 3 │ │ 2020-01-01 │ 4 │ └────────────┴─────┘ ┌───────date─┬─uid─┐ │ 2020-01-02 │ 0 │ │ 2020-01-02 │ 1 │ │ 2020-01-02 │ 2 │ │ 2020-01-02 │ 3 │ │ 2020-01-02 │ 4 │ │ 2020-01-02 │ 5 │ │ 2020-01-02 │ 6 │ │ 2020-01-02 │ 7 │ │ 2020-01-02 │ 8 │ │ 2020-01-02 │ 9 │ └────────────┴─────┘ ┌───────date─┬─uid─┐ │ 2020-01-03 │ 0 │ │ 2020-01-03 │ 1 │ │ 2020-01-03 │ 2 │ │ 2020-01-03 │ 3 │ │ 2020-01-03 │ 4 │ │ 2020-01-03 │ 5 │ │ 2020-01-03 │ 6 │ │ 2020-01-03 │ 7 │ │ 2020-01-03 │ 8 │ │ 2020-01-03 │ 9 │ │ 2020-01-03 │ 10 │ │ 2020-01-03 │ 11 │ │ 2020-01-03 │ 12 │ │ 2020-01-03 │ 13 │ │ 2020-01-03 │ 14 │ └────────────┴─────┘
- Group users by unique ID uid using the retention function.
Query:
SELECT uid, retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r FROM retention_test WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03') GROUP BY uid ORDER BY uid ASC
Result:
┌─uid─┬─r───────┐ │ 0 │ [1,1,1] │ │ 1 │ [1,1,1] │ │ 2 │ [1,1,1] │ │ 3 │ [1,1,1] │ │ 4 │ [1,1,1] │ │ 5 │ [0,0,0] │ │ 6 │ [0,0,0] │ │ 7 │ [0,0,0] │ │ 8 │ [0,0,0] │ │ 9 │ [0,0,0] │ │ 10 │ [0,0,0] │ │ 11 │ [0,0,0] │ │ 12 │ [0,0,0] │ │ 13 │ [0,0,0] │ │ 14 │ [0,0,0] │ └─────┴─────────┘
- Calculate the total number of site visits per day.
Query:
SELECT sum(r[1]) AS r1, sum(r[2]) AS r2, sum(r[3]) AS r3 FROM ( SELECT uid, retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r FROM retention_test WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03') GROUP BY uid )
Result:
┌─r1─┬─r2─┬─r3─┐ │ 5 │ 5 │ 5 │ └────┴────┴────┘
Where:
r1- the number of unique visitors who visited the site during 2020-01-01 (the cond1 condition). r2- the number of unique visitors who visited the site during a specific time period between 2020-01-01 and 2020-01-02 (cond1 and cond2 conditions). r3- the number of unique visitors who visited the site during a specific time period between 2020-01-01 and 2020-01-03 (cond1 and cond3 conditions).
Related issues
No response
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
[WeOpen-Star] I want to try it. please assign this issue to me~
[WeOpen-Star] I want to try it. please assign this issue to me~
Hello handsome, my mentor asked me to try to implement the retention function, and I see thst you are trying to implement it. I think if you haven't implemented it yet, could you please give me this task (just Retention function)? thank you very much!
Plz unassign me. Because doris does not support building the project in the M1 ARM programming environment. At that time, my computer is an M1 chip.
Thank you very much! In fact, I am also using m1 macbook. I usually develop on macbook and compile on Alibaba Cloud server. This is my suggestion and I hope it can help you. 😄
Thank you very much! In fact, I am also using m1 macbook. I usually develop on macbook and compile on Alibaba Cloud server. This is my suggestion and I hope it can help you. 😄
Get it. Thanks for your advice and I'll try that.
retention is picked by a guy. https://github.com/apache/doris/issues/12800.
retention
function has been finished #13056
I will start to implement sequenceCount
#13118 and sequenceMatch
#13117 soon.