paimon
paimon copied to clipboard
[spark] PaimonSplitScan supports column pruning and filter push down
Purpose
PaimonSplitScan is built for internal scan with update/delete/mergeinto. It is used to generate deletion vector, collect touched files, etc. The main usage is to select some metadata columns based on target table, e.g., row index, file path. That says, it does not need to load data columns.
This pr makes PaimonSplitScan support column pruning and filter push down to improve performance:
- introduce
KnownSplitsTable, it is aReadonlyTableand hold some known data splits - introduce
PaimonSplitScanBuilder, it is used when the table is theKnownSplitsTableand buildPaimonSplitScan
For example:
update test set c1 = 9 where c2 = 'a';
before:
(1) BatchScan default.test
Output [5]: [c1#197, c2#198, c3#199, c4#200, __paimon_file_path#205]
class org.apache.paimon.spark.PaimonSplitScan
(2) Filter [codegen id : 1]
Input [5]: [c1#197, c2#198, c3#199, c4#200, __paimon_file_path#205]
Condition : (c2#198 = a)
(3) Project [codegen id : 1]
Output [1]: [__paimon_file_path#205]
Input [5]: [c1#197, c2#198, c3#199, c4#200, __paimon_file_path#205]
after:
(1) BatchScan default.test
Output [2]: [c2#137, __paimon_file_path#144]
PaimonSplitScan: [test], PushedFilters: [Equal(c2, a)]
(2) Filter [codegen id : 1]
Input [2]: [c2#137, __paimon_file_path#144]
Condition : (c2#137 = a)
(3) Project [codegen id : 1]
Output [1]: [__paimon_file_path#144]
Input [2]: [c2#137, __paimon_file_path#144]
Tests
Pass CI
API and Format
No
Documentation
cc @JingsongLi @YannByron thank you
@ulysses-you Can you add a test to verify plan?