superset
superset copied to clipboard
[SIP-121] Proposal for Metrics Comparison base on the Temporal Dimension or/and Textual Dimension
Background
The period-over-period analysis is a BI technique that compares metrics from a recent period to the same period in the past. We already have this feature in Advanced Analytics and apply it to time-series analytic visualization. This feature utilizes time_offsets
and post_processing
in the query_object
to achieve period-over-period analysis. Practically, this feature, along with the entire Advanced Analytics section, should have supported all visualizations because the legacy Time Section has been removed and the Time Filter has already moved into Adhoc Filter.
As an experiment, as long as the time dimension is provided and corresponding time filter populate in Adhoc Filter, any charts should easily incorporate this feature. I've opened a toy pull request to demonstrate that only 5 frontend lines (excluding import statements) are needed to integrate this feature into the current Table chart.
Motivation
Currently, there are some limitations in time_offsets and Advanced Analytics:
-
Theoretically, Period-over-period analysis is a method of analyzing different data slices base on specific dimensions within the same dataset. Therefore, the comparisons should not only be based on temporal dimensions, but also on textual dimensions.
-
According to the first perspective, flexable filters should be provided when users compare data slices, not just time filters.
-
By default, time_offset uses the first occurrence of a time column in
query_object.columns
as the basis for time shifted. It cannot specify which time column to use for generating shifted time range because there can be multiple time filters in current Adhoc Filter. -
The actual shifted time-range values cannot be displayed on the UI and there isn't an approach to provide it.
Based on the above analysis, I propose to enhance/refactoring the time_offsets
field to achieve a more generalized period-over-period analysis
.
Proposed Changes
To illustrate what needs to change, I will construct a time comparison using the current time_offsets
field and the cleaned_sales_data
dataset. This query_object
will generate a time comparison based on the order_date
, comparing data slices from 1 week ago
, 1 month ago
, and 1 year ago
:
{
"columns": [{
"timeGrain": "P1D",
"columnType": "BASE_AXIS",
"sqlExpression": "order_date",
"label": "order_date",
"expressionType": "SQL"
}],
"metrics": ["count"],
"filters": [
{
"col": "order_date",
"op": "TEMPORAL_RANGE",
"val": "2004-01-01 : 2004-01-31"
},
{
"col": "deal_size",
"op": "==",
"val": "Small"
}
],
"time_offsets": [
"1 week ago",
"1 month ago",
"1 year ago"
],
}
After the change, the time_offsets
will replace with comparison
field:
{
"columns": [{
"timeGrain": "P1D",
"columnType": "BASE_AXIS",
"sqlExpression": "order_date",
"label": "order_date",
"expressionType": "SQL"
}],
"metrics": ["count"],
"filters": [
{
"col": "order_date",
"op": "TEMPORAL_RANGE",
"val": "2004-01-01 : 2004-01-31"
},
{
"col": "deal_size",
"op": "==",
"val": "Small"
}
],
"comparisons": {
"on": ["order_date"],
"how": "left",
"suffix": ["1 week ago", "1 month ago", "1 year ago"],
"time_filter_offsets": [
"1 week ago",
"1 month ago",
"1 year ago"
],
"filters": [
[
{
"col": "deal_size",
"op": "==",
"val": "Small"
}
],
[
{
"col": "deal_size",
"op": "==",
"val": "Small"
}
],
[
{
"col": "deal_size",
"op": "==",
"val": "Small"
}
],
],
},
}
on: specifies which columns will be used to join the data slices. how: determines how the data slices will be joined, providing options "inner" and "left," with the default value being "left" suffix : defines suffixes for compared metrics time_filter_offsets: specifies the time offset delta if performing a time comparison; if performing a textual dimension comparison, it should be ignored filters: indicates the filters applied on the specific data slice
The following query_object
payload show that how to generate a textual dimension comparison
{
"columns": [{
"timeGrain": "P1D",
"columnType": "BASE_AXIS",
"sqlExpression": "order_date",
"label": "order_date",
"expressionType": "SQL"
}],
"metrics": ["count"],
"filters": [
{
"col": "order_date",
"op": "TEMPORAL_RANGE",
"val": "2004-01-01 : 2004-01-31"
},
{
"col": "deal_size",
"op": "==",
"val": "Small"
}
],
"comparisons": {
"on": ["deal_size"],
"how": "left",
"suffix": ["Medium", "Large"],
"filters": [
[
{
"col": "order_date",
"op": "TEMPORAL_RANGE",
"val": "2004-01-01 : 2004-01-31"
},
{
"col": "deal_size",
"op": "==",
"val": "Medium"
}
],
[
{
"col": "order_date",
"op": "TEMPORAL_RANGE",
"val": "2004-01-01 : 2004-01-31"
},
{
"col": "deal_size",
"op": "==",
"val": "Large"
}
],
],
},
}
New or Changed Public Interfaces
New dependencies
No new dependencies
Migration Plan and Compatibility
Provide DB migration script
Rejected Alternatives
@zhaoyongjie this looks great, and I think it should solve what we're trying to build. Do you have an example of what the resulting sql query would look like from this api schema?
Hey @zhaoyongjie Thanks for putting this together, it looks really promising. On top of the question from @eschutho I would like to know if we should consider different databases capabilities like CTE and JOINS support when comparing and having a fallback as part of this new processing? Just to handle as mentioned before, some cases where some of those are not supported.
Thanks.
Hey @zhaoyongjie Thanks for putting this together, it looks really promising. On top of the question from @eschutho I would like to know if we should consider different databases capabilities like CTE and JOINS support when comparing and having a fallback as part of this new processing? Just to handle as mentioned before, there are some cases where some of those are not supported.
Also, how datasets that use for example Jinja templates would be compared and processed with the Joins and the offsets? Should we consider cases where we simply don't support comparison and offer the aforementioned fallback methods? or simply ignore them.
Thanks.
I'll reply the topic tonight, thanks for the mention.
Do you have an example of what the resulting sql query would look like from this api schema?
I would like to know if we should consider different databases capabilities like CTE and JOINS support when comparing and having a fallback as part of this new processing? Just to handle as mentioned before, some cases where some of those are not supported.
@eschutho @Antonio-RiveroMartnez The SQL will be the same as the time comparison in the previous AA panel, consisting of multiple separate SQL queries. I have also been thinking about how to design a JOIN QueryObject to achieve this use case and replace the entire Pandas Join. However, I am stuck on how to generate an appropriate time expression that contains time delta, as the time delta expression varies in different databases.
In practice, the JOIN in Superset has been discussed many times in the community, but the original technical design doesn't seem to have considered this use case. The data modeling in Superset depends on SQLLab exposing a "virtual dataset" to achieve it.
However, I am stuck on how to generate an appropriate time expression that contains time delta, as the time delta expression varies in different databases.
@zhaoyongjie can you give more context on why you'd need to compute a time delta?
@betodealmeida
If the different data slices would be filtered in datasource, we have to filter the time filter
+ time delta
in the datasource. different database have different time delta expression.
This is incredibly useful and something I'm surprised isn't built in to superset's tables already. Thanks for putting this together!