metricflow icon indicating copy to clipboard operation
metricflow copied to clipboard

[SL-1756] [SL-1755] [Feature] Use CTEs instead of sub queries in generated SQL

Open Jstein77 opened this issue 1 year ago • 1 comments

Is this your first time submitting a feature request?

  • [X] I have read the expectations for open source contributors
  • [X] I have searched the existing issues, and I could not find an existing issue for this feature
  • [X] I am requesting a straightforward extension of existing metricflow functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently, we render components of the data flow plan as sub-queries. For example, to query a simple metric called revenue, we'll first create an inner query to select the measure and dimension columns, then create an outer query to perform the metric aggregation.

  SELECT
    metric_time__day
    , SUM(revenue) AS revenue
  FROM (
    SELECT
      DATE_TRUNC('day', cast(ordered_at as DATETIME)) AS metric_time__day
      , product_price AS revenue
    FROM ANALYTICS.dbt_jstein.order_items order_item_src_78
  ) subq_119
  GROUP BY
    metric_time__day

This SQL can get more complex when additional joins are needed - for example, to include a dimension, when multiple metrics are requested in a query, or when constructing derived metrics. Issue #422 has an example of a derived metric query with additional layers of nesting. We could instead use CTEs in the SQL to calculate revenue:

with subq_119 as (
     SELECT
      DATE_TRUNC('day', cast(ordered_at as DATETIME)) AS metric_time__day
      , product_price AS revenue
    FROM ANALYTICS.dbt_jstein.order_items order_item_src_78
  ) subq_119

SELECT
  metric_time__day
  , COALESCE(revenue, 0) AS revenue
FROM subq_119
  GROUP BY
    metric_time__day

This provides a few key benefits.

  1. Increases legibility of the SQL generated, especially for complex cases
  2. Can improve the efficiency of the query for derived metrics or queries with multiple metrics. For example, in Issue #422, we could render the following SQL which would avoid multiple table scans of order_items:
(
with base_metrics as (
  SELECT
    DATE_TRUNC('month', ordered_at) AS metric_time__month
    , location_name
    , SUM(order_item_total) AS revenue
    , SUM(product_cost) AS costs
    , SUM(order_item_tax_paid) AS taxes
  FROM dbt_test.dbt_metrics.order_items order_items_src_0
  GROUP BY
    DATE_TRUNC('month', ordered_at)
    , location_name
)

select
 revenue
, costs
, taxes
revenue - (costs + taxes) AS net_profit
from base_metrics

Risks:

  • We would need to test how the query optimizers on different data platforms handle CTEs. There could be cases where we see query performance decline. However, in the case of derived metrics and multiple metrics queries, we should see performance gains.

Describe alternatives you've considered

Keep the current behavior. This comes with a real performance hit for companies querying multiple metrics and derived metrics. We will also be generating less legible SQL.

Who will this benefit?

This will make it easier for analytics engineers to reason about the SQL we're generating, and improve performance for complex queries.

Are you interested in contributing this feature?

No response

Anything else?

No response

SL-1755

SL-1756

Jstein77 avatar Feb 21 '24 22:02 Jstein77