datafusion-ballista
datafusion-ballista copied to clipboard
Support for custom `ParquetFileReaderFactory`
Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)
DataFusion recently added a way to provide a user-defined AsyncFileReader
to ParquetExec
. Currently there is no way to leverage this in Ballista since the deserialization logic will construct a ParquetExec
with the default implementation (which essentially just uses the registered ObjectStore
).
Describe the solution you'd like A clear and concise description of what you want to happen.
We should be able to leverage this feature in Ballista without overriding the entire serialization logic for physical plans.
I see one of two approaches here:
- Push this back into DataFusion and allow registration of custom
ParquetFileReaderFactory
in theSessionContext
somewhere in which case it should be trivial to support in Ballista. - Add this capability in ballista explicitly.
For option 2, we might consider using the PhysicalExtensionCodec
for this. We could add methods:
/// The deserialization logic will invoke this method for any `PhysicalPlanType::ParquetScan` nodes in the serialized plan. If a custom deserialization is required, apply it and return the deserialized result, otherwise return None and the deserialization will fallback to the default
fn try_decode_parquet_exec(
&self,
scan: &ParquetScanExecNode,
) -> Result<Option<ParquetExec>, BallistaError> {
Ok(None)
}
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
@thinkharderdev this seems related: https://github.com/apache/arrow-datafusion/pull/3311
@thinkharderdev this seems related: apache/arrow-datafusion#3311
Thanks! I think that could potentially support this use case. Either TableProviderFactory
or TableProvider
would need to expose a way to get an AyncFileReader
.