superset [SIP-121] Proposal for Metrics Comparison base on the Temporal Dimension or/and Textual Dimension

Background

The period-over-period analysis is a BI technique that compares metrics from a recent period to the same period in the past. We already have this feature in Advanced Analytics and apply it to time-series analytic visualization. This feature utilizes time_offsets and post_processing in the query_object to achieve period-over-period analysis. Practically, this feature, along with the entire Advanced Analytics section, should have supported all visualizations because the legacy Time Section has been removed and the Time Filter has already moved into Adhoc Filter.

As an experiment, as long as the time dimension is provided and corresponding time filter populate in Adhoc Filter, any charts should easily incorporate this feature. I've opened a toy pull request to demonstrate that only 5 frontend lines (excluding import statements) are needed to integrate this feature into the current Table chart.

Motivation

Currently, there are some limitations in time_offsets and Advanced Analytics:

Theoretically, Period-over-period analysis is a method of analyzing different data slices base on specific dimensions within the same dataset. Therefore, the comparisons should not only be based on temporal dimensions, but also on textual dimensions.
According to the first perspective, flexable filters should be provided when users compare data slices, not just time filters.
By default, time_offset uses the first occurrence of a time column in query_object.columns as the basis for time shifted. It cannot specify which time column to use for generating shifted time range because there can be multiple time filters in current Adhoc Filter.
The actual shifted time-range values cannot be displayed on the UI and there isn't an approach to provide it.

Based on the above analysis, I propose to enhance/refactoring the time_offsets field to achieve a more generalized period-over-period analysis.

Proposed Changes

To illustrate what needs to change, I will construct a time comparison using the current time_offsets field and the cleaned_sales_data dataset. This query_object will generate a time comparison based on the order_date, comparing data slices from 1 week ago, 1 month ago, and 1 year ago :

{
  "columns": [{
    "timeGrain": "P1D",
    "columnType": "BASE_AXIS",
    "sqlExpression": "order_date",
    "label": "order_date",
    "expressionType": "SQL"
  }],
  "metrics": ["count"],
  "filters": [
    {
      "col": "order_date",
      "op": "TEMPORAL_RANGE",
      "val": "2004-01-01 : 2004-01-31"
    },
    {
      "col": "deal_size",
      "op": "==",
      "val": "Small"
    }
  ],
  "time_offsets": [
    "1 week ago",
    "1 month ago",
    "1 year ago"
  ],
}

After the change, the time_offsets will replace with comparison field:

{
  "columns": [{
    "timeGrain": "P1D",
    "columnType": "BASE_AXIS",
    "sqlExpression": "order_date",
    "label": "order_date",
    "expressionType": "SQL"
  }],
  "metrics": ["count"],
  "filters": [
    {
      "col": "order_date",
      "op": "TEMPORAL_RANGE",
      "val": "2004-01-01 : 2004-01-31"
    },
    {
      "col": "deal_size",
      "op": "==",
      "val": "Small"
    }
  ],
  "comparisons": {
    "on": ["order_date"],
    "how": "left",
    "suffix": ["1 week ago", "1 month ago", "1 year ago"],
    "time_filter_offsets": [
      "1 week ago",
      "1 month ago",
      "1 year ago"
    ],
    "filters": [
      [
        {
          "col": "deal_size",
          "op": "==",
          "val": "Small"
        }
      ],
      [
        {
          "col": "deal_size",
          "op": "==",
          "val": "Small"
        }
      ],
      [
        {
          "col": "deal_size",
          "op": "==",
          "val": "Small"
        }
      ],
    ],
  },
}

on: specifies which columns will be used to join the data slices. how: determines how the data slices will be joined, providing options "inner" and "left," with the default value being "left" suffix : defines suffixes for compared metrics time_filter_offsets: specifies the time offset delta if performing a time comparison; if performing a textual dimension comparison, it should be ignored filters: indicates the filters applied on the specific data slice

The following query_object payload show that how to generate a textual dimension comparison

{
  "columns": [{
    "timeGrain": "P1D",
    "columnType": "BASE_AXIS",
    "sqlExpression": "order_date",
    "label": "order_date",
    "expressionType": "SQL"
  }],
  "metrics": ["count"],
  "filters": [
    {
      "col": "order_date",
      "op": "TEMPORAL_RANGE",
      "val": "2004-01-01 : 2004-01-31"
    },
    {
      "col": "deal_size",
      "op": "==",
      "val": "Small"
    }
  ],
  "comparisons": {
    "on": ["deal_size"],
    "how": "left",
    "suffix": ["Medium", "Large"],
    "filters": [
      [
        {
          "col": "order_date",
          "op": "TEMPORAL_RANGE",
          "val": "2004-01-01 : 2004-01-31"
        },
        {
          "col": "deal_size",
          "op": "==",
          "val": "Medium"
        }
      ],
      [
        {
          "col": "order_date",
          "op": "TEMPORAL_RANGE",
          "val": "2004-01-01 : 2004-01-31"
        },
        {
          "col": "deal_size",
          "op": "==",
          "val": "Large"
        }
      ],
    ],
  },
}

New or Changed Public Interfaces

New dependencies

No new dependencies

Migration Plan and Compatibility

Provide DB migration script

Rejected Alternatives

Mar 17 '24 14:03 zhaoyongjie

@zhaoyongjie this looks great, and I think it should solve what we're trying to build. Do you have an example of what the resulting sql query would look like from this api schema?

Mar 21 '24 23:03 eschutho

Hey @zhaoyongjie Thanks for putting this together, it looks really promising. On top of the question from @eschutho I would like to know if we should consider different databases capabilities like CTE and JOINS support when comparing and having a fallback as part of this new processing? Just to handle as mentioned before, some cases where some of those are not supported.

Thanks.

Mar 25 '24 15:03 Antonio-RiveroMartnez

Hey @zhaoyongjie Thanks for putting this together, it looks really promising. On top of the question from @eschutho I would like to know if we should consider different databases capabilities like CTE and JOINS support when comparing and having a fallback as part of this new processing? Just to handle as mentioned before, there are some cases where some of those are not supported.

Also, how datasets that use for example Jinja templates would be compared and processed with the Joins and the offsets? Should we consider cases where we simply don't support comparison and offer the aforementioned fallback methods? or simply ignore them.

Thanks.

I'll reply the topic tonight, thanks for the mention.

Mar 25 '24 16:03 zhaoyongjie

Do you have an example of what the resulting sql query would look like from this api schema?

I would like to know if we should consider different databases capabilities like CTE and JOINS support when comparing and having a fallback as part of this new processing? Just to handle as mentioned before, some cases where some of those are not supported.

@eschutho @Antonio-RiveroMartnez The SQL will be the same as the time comparison in the previous AA panel, consisting of multiple separate SQL queries. I have also been thinking about how to design a JOIN QueryObject to achieve this use case and replace the entire Pandas Join. However, I am stuck on how to generate an appropriate time expression that contains time delta, as the time delta expression varies in different databases.

In practice, the JOIN in Superset has been discussed many times in the community, but the original technical design doesn't seem to have considered this use case. The data modeling in Superset depends on SQLLab exposing a "virtual dataset" to achieve it.

Mar 26 '24 22:03 zhaoyongjie

However, I am stuck on how to generate an appropriate time expression that contains time delta, as the time delta expression varies in different databases.

@zhaoyongjie can you give more context on why you'd need to compute a time delta?

Apr 01 '24 17:04 betodealmeida

@betodealmeida

If the different data slices would be filtered in datasource, we have to filter the time filter + time delta in the datasource. different database have different time delta expression.

Apr 01 '24 18:04 zhaoyongjie

This is incredibly useful and something I'm surprised isn't built in to superset's tables already. Thanks for putting this together!

Apr 24 '24 14:04 cannmenus-vib

superset superset copied to clipboard

[SIP-121] Proposal for Metrics Comparison base on the Temporal Dimension or/and Textual Dimension

Background

Motivation

Proposed Changes

New or Changed Public Interfaces

New dependencies

Migration Plan and Compatibility

Rejected Alternatives

superset
superset copied to clipboard