datafusion-ballista icon indicating copy to clipboard operation
datafusion-ballista copied to clipboard

Automatically register tables if env var specified

Open avantgardnerio opened this issue 3 years ago • 14 comments

Which issue does this PR close?

Closes #500.

Rationale for this change

Described in issue.

What changes are included in this PR?

  1. upgrade of datafusion & arrow
  2. config plumbing
  3. conversion of default_session_builder to a supserset of functionality (now with state)
  4. default table factories
  5. a bunch of flight sql work for database introspection (more required in future PRs)

Are there any user-facing changes?

avantgardnerio avatar Nov 08 '22 00:11 avantgardnerio

@andygrove this allows users to see tables in ballista from DataGrip (and probably tableau):

image

avantgardnerio avatar Nov 08 '22 00:11 avantgardnerio

@yahoNanJing if you have a minute, I'd appreciate your feedback as well.

avantgardnerio avatar Nov 08 '22 16:11 avantgardnerio

I'm excited about this feature! I will test this out later this week.

andygrove avatar Nov 08 '22 18:11 andygrove

image

We can now see columns

avantgardnerio avatar Nov 08 '22 20:11 avantgardnerio

@stuartcarnie you might be interested in this as well. I can start copying you on FlightSQL related PRs if you like.

avantgardnerio avatar Nov 09 '22 20:11 avantgardnerio

@avantgardnerio great, thank you!

stuartcarnie avatar Nov 09 '22 22:11 stuartcarnie

I tried this out and did run into an issue when connecting from DataGrip:

INTERNAL: Failed to create SessionContext: Context("Could not create table for file:///mnt/bigdata/tpch/sf10-parquet/partsupp.parquet at /home/andy/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/7b5842b/datafusion/core/src/catalog/listing_schema.rs:116", ObjectStore(Generic { store: "LocalFileSystem", source: UnableToReadBytes { source: Os { code: 21, kind: IsADirectory, message: "Is a directory" }, path: "/mnt/bigdata/tpch/sf10-parquet/partsupp.parquet" } })).

The code assumes that files in DATAFUSION_CATALOG_LOCATION will be single files and not directories containing files.

andygrove avatar Nov 10 '22 00:11 andygrove

I tried this out

Thank you so much for your review! I realize you cloned and did manual testing and that probably took a fair bit of time, which I really appreciate :)

did run into an issue

It definitely works with csvs with headers, and there's a test for that. I'll look into parquet directory support tomorrow. If it's not a big deal, let's get it in this PR. If it isn't trivial though, I'd prefer to document it in another issue and get this PR merged - with the logic that getting the framework in place for auto-registration is the bigger deal, and once it's in we can extend it for all the permutations of parameters.

avantgardnerio avatar Nov 10 '22 03:11 avantgardnerio

great feature! 👍 Is it possible to support custom CatalogProvider in the next step?

r4ntix avatar Nov 10 '22 09:11 r4ntix

Thank you so much for your review! I realize you cloned and did manual testing and that probably took a fair bit of time, which I really appreciate :)

Yes, this is one of the challenges with this project. Many PRs need to be tested end-to-end and this does take time, which is why I mostly only review larger PRs at weekends.

andygrove avatar Nov 10 '22 15:11 andygrove

support custom CatalogProvider

@r4ntix could you explain your use-case? It may already be covered. For example, when this gets merged, deltatables should work out of the box, as long as there are TableProviderFactorys registered for them.

avantgardnerio avatar Nov 10 '22 15:11 avantgardnerio

support custom CatalogProvider

@r4ntix could you explain your use-case? It may already be covered. For example, when this gets merged, deltatables should work out of the box, as long as there are TableProviderFactorys registered for them.

@avantgardnerio according to the current implementation, the automatically registered tables are put into the datafusion.default.{table} datafusion schema. In my case, for tenant isolation, I need to be able to dynamically register tables with different schema or catalog, eg: datafusion.{tenant}.{table} or {tenant}.default.{table}. So is it possible to support CatalogProvider in SessionBuilder to support such case?

r4ntix avatar Nov 11 '22 06:11 r4ntix

is it possible to support CatalogProvider in SessionBuilder

I don't see why not. You could follow the patterns in this PR but with a CatalogProvider instead.

avantgardnerio avatar Nov 11 '22 22:11 avantgardnerio

is it possible to support CatalogProvider in SessionBuilder

I don't see why not. You could follow the patterns in this PR but with a CatalogProvider instead.

Ok, I will try submit a RP for this, thx

r4ntix avatar Nov 14 '22 02:11 r4ntix