malloy
malloy copied to clipboard
Cannot use join from other source in on statement
Repro:
source: doubles is duckdb.sql("""
SELECT base."generate_series" * 2 as n
FROM generate_series(1, 10) as base
""")
source: nums is duckdb.sql("""
SELECT base."generate_series" as n
FROM generate_series(1, 10) as base
""") extend {
join_one: double is doubles on double.n = n * 2
}
source: nums_with_copy is duckdb.sql("""
SELECT base."generate_series" as n
FROM generate_series(1, 10) as base
""") extend {
join_one: copy is nums on n = copy.double.n / 2
}
run: nums_with_copy -> { select: copy.n }
In this code, we create a nums
source which joins to doubles
, then attempt to create a nums_with_copy
which joins in nums
using nums.double
in the on
statement. When attempting to use the resulting copy
join in a query, we don't generate SQL for the double
join, and therefore the on
statement cannot execute.
SELECT
copy_0."n" as "n"
FROM (
SELECT base."generate_series" as n
FROM generate_series(1, 10) as base
) as nums_with_copy
LEFT JOIN (
SELECT base."generate_series" as n
FROM generate_series(1, 10) as base
) AS copy_0
ON nums_with_copy."n"=double_0."n"*1.0/2
Binder Error: Referenced table "double_0" not found!
Candidate tables: "copy_0"
LINE 11: ON nums_with_copy."n"=double_0."n"*1.0/2
I'm not exactly sure what SQL we should be generating, though...
I think this SQL works:
Though this is probably less performant...
SELECT
copy_0."n" as "n"
FROM (
SELECT base."generate_series" as n
FROM generate_series(1, 10) as base
) as nums_with_copy
LEFT JOIN (
SELECT base."generate_series" as n
FROM generate_series(1, 10) as base
) AS copy_0
ON true
LEFT JOIN (
SELECT base."generate_series" * 2 as n
FROM generate_series(1, 10) as base
) AS double_0
ON double_0."n"=(copy_0."n"*2)
WHERE nums_with_copy."n"=double_0."n"*1.0/2
I think the fix might be:
- make sure to check the
on
statement for joins, and generate those joins - if a join is "two levels deep" or more (like in this case,
double_0
is not joined from the original query, but joined in the join clause ofcopy_0
), remove theon
statement of that join and add it as aWHERE
clause later.
I think I hit the same issue earlier when attempting to build a demo notebook. My exact case is slightly different but I think the root cause might be the same. Branch is here.
My model is as follows:
import 'ga4.malloy'
source: latest is events -> {
aggregate: latest_day is max(event_timestamp.day)
}
source: latest_events is events extend {
join_one: latest
where: event_timestamp.day = latest.latest_day
}
source: events_2 is events extend {
join_one: latest_events on user_pseudo_id = latest_events.user_pseudo_id
}
run: events_2 -> {
group_by: latest_events.platform
aggregate: c is count()
}
My source events_2
is joining to latest_events
, and latest_events
is filtered on a field from a third source called latest
. When I attempt to group on a field from latest_events
, the join to the latest
source never gets generated in the SQL. The SQL looks like this (on DuckDB):
SELECT
latest_events_0.platform as `platform`,
COUNT( 1) as `c`
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` as events_2
LEFT JOIN `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` AS latest_events_0
ON events_2.user_pseudo_id=latest_events_0.user_pseudo_id AND (TIMESTAMP_TRUNC((timestamp_micros(latest_events_0.event_timestamp)),day)=(latest_0.latest_day))
GROUP BY 1
ORDER BY 2 desc
The final clause in the LEFT JOIN is (TIMESTAMP_TRUNC((timestamp_micros(latest_events_0.event_timestamp)),day)=(latest_0.latest_day)
, but latest_0
is nowhere to be found.