datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Implement `DynamicTableProvider` in DataFusion Core

Open goldmedal opened this issue 1 year ago • 4 comments

Is your feature request related to a problem or challenge?

I had some discussions with @alamb about supporting a dynamic file data source (select ... from 'select .. from 'data.parquet' like #4805) in the core, as mentioned in https://github.com/apache/datafusion/issues/4850#issuecomment-2142190951. However, we found that it's not a good idea to move so many dependencies (e.g., S3-related) to the core crate after #10745.

Describe the solution you'd like

As @alamb proposed in https://github.com/apache/datafusion/pull/10745#issuecomment-2175817937, we can focus first on the logic that interprets table names as potential object store locations. Implement a struct DynamicTableProvider and a trait called UrlLookup to get ObjectStore at runtime.

struct DynamicTableProvider {
  // ...
  /// A callback function that is 
  url_lookup: Arc<dyn UrlLookup>
}

/// Trait for looking up the correct object store instance based on URL
pub trait UrlLookup {
  fn lookup(&self, url: &Url) -> Result<Arc<dyn ObjectStore>>;
}

By default, DynamicTableProvider only supports querying local file paths like file:///.... The implementation of dynamic file queries in datafusion-cli might also be based on DynamicTableProvider but will load the common object storage dependency by default.

Describe alternatives you've considered

No response

Additional context

No response

goldmedal avatar Jun 18 '24 16:06 goldmedal

take

goldmedal avatar Jun 18 '24 16:06 goldmedal

Thank you @goldmedal

alamb avatar Jun 18 '24 20:06 alamb

Hi @alamb,

I created a draft PR for this issue in #11035. After some experiments, I think passing only ObjectStore isn't enough for creating a TableProvider at runtime. We need to build the schema from a full SessionState.

Although there are many issues that need to be fixed, could you take a look at this PR to check if this idea makes sense when you're available?

Thanks.

goldmedal avatar Jun 20 '24 18:06 goldmedal

I have finished the PR but I think there're two follow-up issues needed to be filed:

  • https://github.com/apache/datafusion/pull/11035#discussion_r1649325843
  • https://github.com/apache/datafusion/pull/11035#discussion_r1649557123

goldmedal avatar Jun 22 '24 05:06 goldmedal