sqlflow
sqlflow copied to clipboard
[Discussion] SQL (or extended) statement to convert the data of sliding windows to rows.
Requirement
There are some scenarios of time series forecasting in real world, such as forecasting the pv/uv or the traffic load from the history data. The sample data is as follows:
| app_id | time | uv | pv |
|---|---|---|---|
| 1 | 1 | 1 | 5 |
| 1 | 2 | 2 | 6 |
| 1 | 3 | 3 | 7 |
| 1 | 4 | 4 | 8 |
| 2 | 1 | 11 | 12 |
| 2 | 2 | 13 | 14 |
| 2 | 3 | 15 | 16 |
| 2 | 4 | 17 | 18 |
The table above contains the data of two entities/app_id 1 and 2. Each entity has the uv/pv in 4 time steps 1, 2, 3, 4.
The common data transform function for time series data is convert the data of sliding windows to rows. If the window size is 2, the result table is:
| app_id | time | uv | uv_1 | uv_2 | pv | pv_1 | pv_2 |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | NULL | NULL | 5 | NULL | NULL |
| 1 | 2 | 2 | 1 | NULL | 6 | 5 | NULL |
| 1 | 3 | 3 | 2 | 1 | 7 | 6 | 5 |
| 1 | 4 | 4 | 3 | 2 | 8 | 7 | 6 |
| 2 | 1 | 11 | NULL | NULL | 12 | NULL | NULL |
| 2 | 2 | 13 | 11 | NULL | 14 | 12 | NULL |
| 2 | 3 | 15 | 13 | 11 | 16 | 14 | 12 |
| 2 | 4 | 17 | 15 | 13 | 18 | 16 | 14 |
Points:
- Do the existed SQL syntax (MaxCompute / Hive / MySQL) support this transformation? If yes, how complex is it?
- Do we need extend the SQL syntax in SQLFlow (such as
TO RUNclause) ?
MaxCompute SQL:
SELECT
app_id,
time,
uv,
LAG(uv, 1) OVER (PARTITION BY app_id ORDER BY time) AS uv_1,
LAG(uv, 2) OVER (PARTITION BY app_id ORDER BY time) AS uv_2,
pv,
LAG(pv, 1) OVER (PARTITION BY app_id ORDER BY time) AS pv_1,
LAG(pv, 2) OVER (PARTITION BY app_id ORDER BY time) AS pv_2
FROM ts_data
ORDER BY app_id, time;
PS:If the windows size is larger (10 or more), the SQL statement will be long.
We will add TO RUN statement in SQLFlow. #2355
We can wrap the logic of convert the data of sliding windows to rows into python program and build it into a docker image. SQLFlow can invoke this functionality using TO RUN statement.
In this way, user don't need write a long SQL to complete this task.