kedro-plugins
kedro-plugins copied to clipboard
New dataset databricks.ExternalTableDataset
Description
Already existing dataset databricks.ManagedTableDataset doesn't allow to specify the location of the stored files, which in some setups is crucial. There's already PR #251 for it, but it seems to be stale.
Context
I develop a number of kedro projects that are deployed to Databricks. Having a single dataset that handles both pandas and spark DFs, and can write into (and read from) DBX database would be a lifesaver, as long as I could specify the path.
Possible Implementation
In spark, it suffices to add path
option to make table external. I'm not sure if it would be as simple here though.
Possible Alternatives
Adding an argument to ManagedTableDataset is also an option, but then the table wouldn't really be Managed - it might cause some confusion
Hi @KrzysztofDoboszInpost, thanks for opening this issue. Do you want to take over #251? Checking out the branch and opening a new PR should suffice.
Sure, as soon as I'll be able to :)
In the meantime: would you rather create a separate ExternalTableDataset, with a lot of common code with ManagedTableDataset (possibly inherited?), or just add an option to set path
(like in current PR) and risk a little confusion among Databricks users?
This deserves some investigation indeed :) Let's continue the discussion here until we're clear on the path forward. I'll add this to our backlog.