delta
delta copied to clipboard
[Kernel][Spark DSV2][API] Return remaining filters in ScanBuilder instead of in Scan
Feature request
Which Delta project/connector is this regarding?
- [ ] Spark
- [ ] Standalone
- [ ] Flink
- [X] Kernel
- [ ] Other (fill in here)
Overview
Currently Kernel supports filtering by providing a query predicate in ScanBuilder.withFilter
. You are then able to retrieve the remaining predicates (that need to be evaluated on the returned data) after building the scan with Scan.getRemainingFilter
.
This conflicts with the Spark DSV2 APIs for predicate pushdown.
- Spark has interface SupportsPushDownV2Filters with method
pushPredicates
which returns the remaining predicate during the scan building phase:
/**
* Pushes down predicates, and returns predicates that need to be evaluated after scanning.
* <p>
* Rows should be returned from the data source if and only if all of the predicates match.
* That is, predicates must be interpreted as ANDed together.
*/
Predicate[] pushPredicates(Predicate[] predicates);
This is a suggestion to update Kernel's ScanBuilder
interface to behave similarly and return the remaining predicate during scan building. We already have access to the table's Metadata (schema and partition info) during building so there is no additional work needed besides splitting the partition and data predicates.
Motivation
- Unblock a potential future Spark DSV2 connector such that it can correctly get the remaining filter and not re-evaluate all predicates
- Unblock any other potential connectors that have similar design
I'm a bit unclear on how this "remaining filters" thing works -- data skipping isn't perfect (may return extra rows), so wouldn't we anyway have to apply all predicates against the returned data? Or is this also covering the partition filtering path, which does have perfect pruning?
@scovich This also covers the partition filtering path. We include any data filters in the "remaining filters" that still need to be evaluated on the returned data. What we don't include are any partition filters that are applied.