glaredb icon indicating copy to clipboard operation
glaredb copied to clipboard

Chore: refactor datasources

Open universalmind303 opened this issue 11 months ago • 3 comments

Description

proposal to do some internal refactoring on table functions and datasources to make adding new ones easier, while also making better usage of incremental compilation.

Some background context..

It is a pretty big pain point that table syntax is completely separate from table function syntax. We usually don't implement these at the same time due to the added complexity of the table syntax.

select * from read_excel(...) # table function
create external table t from excel options (...) # table syntax

The code also lives in very different parts of the application and is quite difficult to follow.

Ideally we should be able to create a new datasource in isolation and register it all in one shot. Currently we have to

  1. create the TableProvider impl ex: crates/datasources/src/excel/mod.rs
  2. create the FunctionEntry (for the catalog) ex: crates/sqlbuiltins/src/functions/table/excel.rs:21
  3. create the TableFunc crates/sqlbuiltins/src/functions/table/excel.rs:21
  4. create the TableOptions proto crates/protogen/proto/metastore/options.proto:208
  5. create the TableOptions crates/protogen/src/metastore/types/options.rs:1053`
  6. correlate the options variant (this happens in quite a few spots in options.rs)
  7. finally connect it into the planner: crates/sqlexec/src/planner/session_planner.rs:903

That's a lot of different parts of the code that need to be changed. Datasources should instead just program to a trait or traits, we register them in the registry, and then all of the above should just work. Reducing the amount of code to just two. (datasource implementation, and registry).

Proposal

The sqlbuiltins::functions module currently provides

  • TableFunc (used to convert a table function into a table provider)
  • BuiltinFunction this should be renamed to FunctionCatalogEntry or similar (used to register a function within the catalog)

and datafusion provides us with TableProvider trait that is used for creating record batches.

Currently we have no traits for the table options syntax so it is manually implemented each time, but we probably need one. DynTableOptions (bikeshedding)

I suggest that we create a new crate that provides all of this functionality wrapped up in a single trait Datasource: TableFunc + FunctionCatalogEntry + TableProvider + TryFrom<DynTableOptions> (bikeshedding). then that can be used to populate the catalog, function registry, and planner all at once.

So when adding a new datasource, you just implement this set of traits, and plug it in as Arc<dyn Datasource> somewhere and we're good to go.

universalmind303 avatar Mar 06 '24 00:03 universalmind303

I like this!

  • there's a natural connection between the TableOptions trait (whatever form that takes) and the arguments to the table function? I don't think we need to design it now, but I think it's worth calling out early.
  • I agree that we need some common infrastructure for specifying options to tables. my guess is that the proto serialization for the makes this harder than it should be. similar sort of early highlight.
  • there have been some discussions about changing the way users should specify tables. I don't think any of those changes would impact this part of the implementation, but I think it is true that the parser of the DDL statements needs to have some interface to the datasources (or vice versa) and I'm not sure if this plan covers this case.
  • I think #2333 ends up being "just another option", but I think, having done that recently, I want to make sure "adding features" to our tables remains possible, and doesn't get more difficult.
  • I'd like us to think a bit more about the "plug it in somewhere part of the story (and when the plugging happens,) One of the big benefits we get out of this is the dependency injection, and this is a bit under specified and something that we'll all have to interact with a bunch.
  • We might want to have the registry system to have versions for implementations (e.g. like different versions of duckdb which have incompatible files.) Feels generally useful in the long term. This doesn't need to be in scope now but it might be nice to file it under future work so we don't preclude it.
  • I think it's worth calling out that we do some inferring of data sources outside of "create table" and "table function": the "provide something that looks like a path and we do some selection off of extensions". Being able to add and manage extensions for this case might make the registry a bit more complicated.
  • related to the previous, the object-store data sources, depend on and involve the csv/parquet/json implementations. I am generally of the opinion that object-store data sources with implicit format detection is probably more confusing than it is useful, but it does mean that there are a cluster of datafusion-native data sources that depend on each other.
  • I'm curious how all of this stuff interacts with our native delta storage (is it totally orthogonal? is there a possibility for making that more simple?)
  • edited to add: I was just poking in this code, but we should also add to this plan to cover COPY TO which is all implemented in sqlexec and "feels" somewhat external to the data sources, but maybe shouldn't be.

Hope these help. I'm pretty excited for this to get more simple!

tychoish avatar Mar 06 '24 01:03 tychoish

I love this idea - it opens up the possibility for guides to make it easy for contributors to add new data sources, too.

talagluck avatar Mar 06 '24 09:03 talagluck

unsolicited opinion but generally agree with the basic premise, subscribed and excited to see where this goes!

hailelagi avatar Mar 06 '24 15:03 hailelagi