metricflow
metricflow copied to clipboard
[SL-1756] [SL-1755] [Feature] Use CTEs instead of sub queries in generated SQL
Is this your first time submitting a feature request?
- [X] I have read the expectations for open source contributors
- [X] I have searched the existing issues, and I could not find an existing issue for this feature
- [X] I am requesting a straightforward extension of existing metricflow functionality, rather than a Big Idea better suited to a discussion
Describe the feature
Currently, we render components of the data flow plan as sub-queries. For example, to query a simple metric called revenue
, we'll first create an inner query to select the measure and dimension columns, then create an outer query to perform the metric aggregation.
SELECT
metric_time__day
, SUM(revenue) AS revenue
FROM (
SELECT
DATE_TRUNC('day', cast(ordered_at as DATETIME)) AS metric_time__day
, product_price AS revenue
FROM ANALYTICS.dbt_jstein.order_items order_item_src_78
) subq_119
GROUP BY
metric_time__day
This SQL can get more complex when additional joins are needed - for example, to include a dimension, when multiple metrics are requested in a query, or when constructing derived metrics. Issue #422 has an example of a derived metric query with additional layers of nesting. We could instead use CTEs in the SQL to calculate revenue
:
with subq_119 as (
SELECT
DATE_TRUNC('day', cast(ordered_at as DATETIME)) AS metric_time__day
, product_price AS revenue
FROM ANALYTICS.dbt_jstein.order_items order_item_src_78
) subq_119
SELECT
metric_time__day
, COALESCE(revenue, 0) AS revenue
FROM subq_119
GROUP BY
metric_time__day
This provides a few key benefits.
- Increases legibility of the SQL generated, especially for complex cases
- Can improve the efficiency of the query for derived metrics or queries with multiple metrics. For example, in Issue #422, we could render the following SQL which would avoid multiple table scans of
order_items
:
(
with base_metrics as (
SELECT
DATE_TRUNC('month', ordered_at) AS metric_time__month
, location_name
, SUM(order_item_total) AS revenue
, SUM(product_cost) AS costs
, SUM(order_item_tax_paid) AS taxes
FROM dbt_test.dbt_metrics.order_items order_items_src_0
GROUP BY
DATE_TRUNC('month', ordered_at)
, location_name
)
select
revenue
, costs
, taxes
revenue - (costs + taxes) AS net_profit
from base_metrics
Risks:
- We would need to test how the query optimizers on different data platforms handle CTEs. There could be cases where we see query performance decline. However, in the case of derived metrics and multiple metrics queries, we should see performance gains.
Describe alternatives you've considered
Keep the current behavior. This comes with a real performance hit for companies querying multiple metrics and derived metrics. We will also be generating less legible SQL.
Who will this benefit?
This will make it easier for analytics engineers to reason about the SQL we're generating, and improve performance for complex queries.
Are you interested in contributing this feature?
No response
Anything else?
No response