Implement `DynamicTableProvider` in DataFusion Core
Is your feature request related to a problem or challenge?
I had some discussions with @alamb about supporting a dynamic file data source (select ... from 'select .. from 'data.parquet' like #4805) in the core, as mentioned in https://github.com/apache/datafusion/issues/4850#issuecomment-2142190951. However, we found that it's not a good idea to move so many dependencies (e.g., S3-related) to the core crate after #10745.
Describe the solution you'd like
As @alamb proposed in https://github.com/apache/datafusion/pull/10745#issuecomment-2175817937, we can focus first on the logic that interprets table names as potential object store locations. Implement a struct DynamicTableProvider and a trait called UrlLookup to get ObjectStore at runtime.
struct DynamicTableProvider {
// ...
/// A callback function that is
url_lookup: Arc<dyn UrlLookup>
}
/// Trait for looking up the correct object store instance based on URL
pub trait UrlLookup {
fn lookup(&self, url: &Url) -> Result<Arc<dyn ObjectStore>>;
}
By default, DynamicTableProvider only supports querying local file paths like file:///.... The implementation of dynamic file queries in datafusion-cli might also be based on DynamicTableProvider but will load the common object storage dependency by default.
Describe alternatives you've considered
No response
Additional context
No response
take
Thank you @goldmedal
Hi @alamb,
I created a draft PR for this issue in #11035. After some experiments, I think passing only ObjectStore isn't enough for creating a TableProvider at runtime. We need to build the schema from a full SessionState.
Although there are many issues that need to be fixed, could you take a look at this PR to check if this idea makes sense when you're available?
Thanks.
I have finished the PR but I think there're two follow-up issues needed to be filed:
- https://github.com/apache/datafusion/pull/11035#discussion_r1649325843
- https://github.com/apache/datafusion/pull/11035#discussion_r1649557123