paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[spark] PaimonSplitScan supports column pruning and filter push down

Open ulysses-you opened this issue 1 year ago • 1 comments

Purpose

PaimonSplitScan is built for internal scan with update/delete/mergeinto. It is used to generate deletion vector, collect touched files, etc. The main usage is to select some metadata columns based on target table, e.g., row index, file path. That says, it does not need to load data columns.

This pr makes PaimonSplitScan support column pruning and filter push down to improve performance:

  1. introduce KnownSplitsTable, it is a ReadonlyTable and hold some known data splits
  2. introduce PaimonSplitScanBuilder, it is used when the table is the KnownSplitsTable and build PaimonSplitScan

For example:

update test set c1 = 9 where c2 = 'a';

before:

(1) BatchScan default.test
Output [5]: [c1#197, c2#198, c3#199, c4#200, __paimon_file_path#205]
class org.apache.paimon.spark.PaimonSplitScan

(2) Filter [codegen id : 1]
Input [5]: [c1#197, c2#198, c3#199, c4#200, __paimon_file_path#205]
Condition : (c2#198 = a)

(3) Project [codegen id : 1]
Output [1]: [__paimon_file_path#205]
Input [5]: [c1#197, c2#198, c3#199, c4#200, __paimon_file_path#205]

after:

(1) BatchScan default.test
Output [2]: [c2#137, __paimon_file_path#144]
PaimonSplitScan: [test], PushedFilters: [Equal(c2, a)]

(2) Filter [codegen id : 1]
Input [2]: [c2#137, __paimon_file_path#144]
Condition : (c2#137 = a)

(3) Project [codegen id : 1]
Output [1]: [__paimon_file_path#144]
Input [2]: [c2#137, __paimon_file_path#144]

Tests

Pass CI

API and Format

No

Documentation

ulysses-you avatar Sep 19 '24 07:09 ulysses-you

cc @JingsongLi @YannByron thank you

ulysses-you avatar Sep 19 '24 08:09 ulysses-you

@ulysses-you Can you add a test to verify plan?

JingsongLi avatar Sep 24 '24 10:09 JingsongLi