malloy icon indicating copy to clipboard operation
malloy copied to clipboard

Cannot use join from other source in on statement

Open christopherswenson opened this issue 1 year ago • 3 comments

Repro:

source: doubles is duckdb.sql("""
  SELECT base."generate_series" * 2 as n
  FROM generate_series(1, 10) as base
""")

source: nums is duckdb.sql("""
  SELECT base."generate_series" as n
  FROM generate_series(1, 10) as base
""") extend {
  join_one: double is doubles on double.n = n * 2
}

source: nums_with_copy is duckdb.sql("""
  SELECT base."generate_series" as n
  FROM generate_series(1, 10) as base
""") extend {
  join_one: copy is nums on n = copy.double.n / 2
} 

run: nums_with_copy -> { select: copy.n }

In this code, we create a nums source which joins to doubles, then attempt to create a nums_with_copy which joins in nums using nums.double in the on statement. When attempting to use the resulting copy join in a query, we don't generate SQL for the double join, and therefore the on statement cannot execute.

SELECT 
   copy_0."n" as "n"
FROM (
  SELECT base."generate_series" as n
  FROM generate_series(1, 10) as base
) as nums_with_copy
LEFT JOIN (
  SELECT base."generate_series" as n
  FROM generate_series(1, 10) as base
) AS copy_0
  ON nums_with_copy."n"=double_0."n"*1.0/2
Binder Error: Referenced table "double_0" not found!
Candidate tables: "copy_0"
LINE 11:   ON nums_with_copy."n"=double_0."n"*1.0/2

I'm not exactly sure what SQL we should be generating, though...

christopherswenson avatar Nov 09 '23 15:11 christopherswenson

I think this SQL works:

Though this is probably less performant...

SELECT 
  copy_0."n" as "n"
FROM (
  SELECT base."generate_series" as n
  FROM generate_series(1, 10) as base
) as nums_with_copy
LEFT JOIN (
  SELECT base."generate_series" as n
  FROM generate_series(1, 10) as base
) AS copy_0
  ON true
LEFT JOIN (
  SELECT base."generate_series" * 2 as n
  FROM generate_series(1, 10) as base
) AS double_0
  ON double_0."n"=(copy_0."n"*2)
WHERE nums_with_copy."n"=double_0."n"*1.0/2

christopherswenson avatar Nov 09 '23 15:11 christopherswenson

I think the fix might be:

  • make sure to check the on statement for joins, and generate those joins
  • if a join is "two levels deep" or more (like in this case, double_0 is not joined from the original query, but joined in the join clause of copy_0), remove the on statement of that join and add it as a WHERE clause later.

christopherswenson avatar Nov 09 '23 16:11 christopherswenson

I think I hit the same issue earlier when attempting to build a demo notebook. My exact case is slightly different but I think the root cause might be the same. Branch is here.

My model is as follows:

import 'ga4.malloy'

source: latest is events -> {
  aggregate: latest_day is max(event_timestamp.day)
}

source: latest_events is events extend {
  join_one: latest
  where: event_timestamp.day = latest.latest_day
}

source: events_2 is events extend {
  join_one: latest_events on user_pseudo_id = latest_events.user_pseudo_id
}

run: events_2 -> {
  group_by: latest_events.platform
  aggregate: c is count()
}

My source events_2 is joining to latest_events, and latest_events is filtered on a field from a third source called latest. When I attempt to group on a field from latest_events, the join to the latest source never gets generated in the SQL. The SQL looks like this (on DuckDB):

SELECT 
   latest_events_0.platform as `platform`,
   COUNT( 1) as `c`
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` as events_2
LEFT JOIN `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` AS latest_events_0
  ON events_2.user_pseudo_id=latest_events_0.user_pseudo_id AND (TIMESTAMP_TRUNC((timestamp_micros(latest_events_0.event_timestamp)),day)=(latest_0.latest_day))
GROUP BY 1
ORDER BY 2 desc

The final clause in the LEFT JOIN is (TIMESTAMP_TRUNC((timestamp_micros(latest_events_0.event_timestamp)),day)=(latest_0.latest_day), but latest_0 is nowhere to be found.

carlineng avatar Nov 10 '23 02:11 carlineng