trino icon indicating copy to clipboard operation
trino copied to clipboard

Push join into TableScan through Project

Open SemionPar opened this issue 10 months ago • 7 comments

Description

Add a ruleset to push join into TableScan through Project.

Draft todos:

  • [x] decide whether to merge with PushJoinIntoTableScan - extend PushJoinIntoTableScan with new behavior
  • [x] Result correctness - PostgreSQL reference methods: account for collation - UPPER/LOWER COLLATE C as well as other standard collations are not producing results consisting with Trino:

PostgreSQL:

SELECT UPPER('ślad');

C_Collation        PL_Collation
--------------------------------------
śLAD                  ŚLAD

Trino:

SELECT UPPER('ślad');

--------------------------------------
ŚLAD

Therefore the test use REVERSE as a reference function call which does not exhibit inconsistent behavior.

Additional context and related issues

Why push though projection is needed? To enable pushdown of joins with function calls in join conditions (UPPER/LOWER/CASTS etc.)

Why projection is not consumed by connector? It is currently unsupported for JDBC connectors. There was an attempt https://github.com/trinodb/trino/pull/19740, but it didn't make it due to reasons (edge cases, fact that it needs JDBC type to be derived from Trino type when mapping).

What if there is filter below projection? This rule will not work.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required. ( ) Release notes are required. Please propose a release note for me. ( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

SemionPar avatar Apr 18 '24 09:04 SemionPar

Draft todos: PostgreSQL reference methods: account for collation

Another function can be selected for tests which does not imply such complications.

ssheikin avatar May 10 '24 17:05 ssheikin

@lukasz-stec @raunaqmorarka please review.

ssheikin avatar May 13 '24 06:05 ssheikin

Can you elaborate on this point?

It is currently unsupported for JDBC connectors. There was an attempt https://github.com/trinodb/trino/pull/19740, but it didn't make it due to reasons (edge cases, fact that it needs JDBC type to be derived from Trino type when mapping).

What are the edge cases? What's the issue with deriving a JDBC type?

martint avatar May 16 '24 15:05 martint

Wrt to type derivation we have two issues

  • In some JDBC connector like PostgreSQL - we try to map unknown column type or special column type like money as varchar - so in this case we should somehow restrict the expression rewrite - I think it can be fixed by avoid rewriting the variables which are synthetically mapped.
  • In case of projection pushdown - we need to ensure that type returned by the native PG function matches with Trino's type to ensure the subsequent filter operation is accurate

Praveen2112 avatar May 16 '24 15:05 Praveen2112

Instead of a dedicated rule to pull a projection above join before pushing down the join into the connector, we should consider adding a generic rule for doing that optimization regardless of where it occurs in the plan.

martint avatar May 16 '24 16:05 martint

But pushing join before projection as a generic rule might affect the overall plan right ? What if its a cross join or if its estimate result in more number of rows than probe or build side ?

Praveen2112 avatar May 16 '24 17:05 Praveen2112

We could make it a cost-aware decision. In practice, joins that explode data (in particular, cross joins that produce significant amount of data) are not that common, though.

martint avatar May 16 '24 17:05 martint

Few thoughts on idea proposed by @martint:

Emphasis: regardless of where it occurs in the plan

Instead of PushJoinIntoTableScan we could have PullProjectionAboveNode (not an established name, existing code use PushXXXIntoYYY convention). What types of Node would qualify for this operation? In our concrete Join case we have this incoming fragment:

Join
 \
  | Project
    \
     | TableScan
  | TableScan

Should we consider only pulling Project up when directly over TableScan:

TOP_XXX
 \
  | Project
    \
     | TableScan

Or in any position?

TOP_XXX
 \
  | Project
    \
     | BOTTOM_YYY
       \
        | TableScan

Emphasis: before pushing down the join into the connector

This generic rule would not attempt to call Metadata.applyJoin (pushdown attempt), but just pull Poject above Join, if possible? But how can Project node be pulled up without pushing the Join? Consider this case (one of tested queries) - project node sole purpose here is to transform left side of join criteria, so it must be performed before joining data:

SELECT c.custkey, o.totalprice FROM customer c JOIN orders o ON REVERSE(c.phone) = o.clerk;
Output[columnNames = [custkey, totalprice]]
│   Layout: [custkey:bigint, totalprice:double]
└─ InnerJoin[criteria = (expr = clerk), distribution = PARTITIONED]
   │   Layout: [custkey:bigint, totalprice:double]
   │   Distribution: PARTITIONED
   ├─ ScanProject[table = postgresql:tpch.customer tpch.customer columns=[custkey:bigint:int8, phone:varchar(15):varchar]]
   │      Layout: [custkey:bigint, expr:varchar(15)]
   │      expr := reverse(phone)
   │      custkey := custkey:bigint:int8
   │      phone := phone:varchar(15):varchar
   └─ TableScan[table = postgresql:tpch.orders tpch.orders columns=[totalprice:double:float8, clerk:varchar(15):varchar]]
          Layout: [totalprice:double, clerk:varchar(15)]
          clerk := clerk:varchar(15):varchar
          totalprice := totalprice:double:float8

Is my reasoning correct here?

SemionPar avatar Jun 12 '24 12:06 SemionPar

we could have PullProjectionAboveNode

Rather, it would be PushXXXBelowProject, and it should generally be done:

  • for any operation that reduces the amount of data that the projection would evaluate
  • for projections that don't prune the number of input columns

But how can Project node be pulled up without pushing the Join

It can't. But that's a case where pushing down the join into the connector first requires pushing down the projection. That should be handled normally by an invocation of applyProject() followed by applyJoin().

martint avatar Jul 01 '24 19:07 martint

The problem I was trying to solve in this PR (pushing Join through Project into TableScan to enable join pushdown with function calls in the join condition) is solved by https://github.com/trinodb/trino/pull/22203 (amazing work @Praveen2112!). With this new projection pushdown framework, all the test cases I added for PostgreSQL joins with REVERSE are fully pushed down in the intended way (PushProjectionIntoTableScan -> PushJoinIntoTableScan).

SemionPar avatar Jul 04 '24 10:07 SemionPar

Does it still makes sense to continue with PushXXXBelowProject approach, does it bring value otherwise?

Rather, it would be PushXXXBelowProject, and it should generally be done:

  • for any operation that reduces the amount of data that the projection would evaluate
  • for projections that don't prune the number of input columns.

These points seem relevant. I can repurpose this PR or create a new one, I would appreciate some help with crafting a new problem statement though.

SemionPar avatar Jul 04 '24 11:07 SemionPar

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions[bot] avatar Jul 25 '24 17:07 github-actions[bot]

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

github-actions[bot] avatar Aug 16 '24 17:08 github-actions[bot]