kedro-plugins
kedro-plugins copied to clipboard
`kedro-datasets`: Add Ibis.SQLDataset
Description
Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."
There is a pandas.SQLDataset - ibis is a dataframe first library, but it's a superset on pandas.SQLDataset in a few way:
- It support a lot more engine out of the box - in a way you can view it as a generic SQLDataset (work with snowflake too)
- It can mix SQL and dataframe operation together (not necessary common, but could be quite useful for migration. i.e. you already have a lot of SQL script pipeline, so you can use the old SQL as it but writing new code in dataframe (or SQL ) if you want.
- It's lazily evaluated - which allows user to break SQL into multiple nodes. Current pandas SQLDataset always materialise the change. This is 1. not efficient, 2. limit the ability to reuse SQL logic. (In theory you can write pure SQL Kedro pipeline in ibis, though if this is the only use case,
dbtis probably a better choice.
In any case, this is still a big improvement and give user more flexibility.
Context
Why is this change important to you? How would you use it? How can it benefit other users?
Possible Implementation
(Optional) Suggest an idea for implementing the addition or change. Will raise a following PR soon.
You can use Table.sql to execute SQL queries on top of Ibis tables. Since Ibis execution is deferred anyway, there's no real downside to just loading the table using an ibis.TableDataset and then performing the query.
The main benefit I can think of for ibis.SQLDataset would be that you could define a SQL query in your catalog without having to create a node for it—but I don't know how good of an idea this actually is? If anything, based on your justifications above, it still feels more natural to just have the SQL logic in the nodes.
@deepyaman
You can use Table.sql to execute SQL queries on top of Ibis tables
True, but it's not obvious. The TableDataset also being very clear that it is a table, if I end up writing SQL with joins (other table), or even just regular SQL (the table argument is not optional, but will neither be used in the SQL itself which feels weird). I feel like there is some decision make already, thus we have TableDataset, FileDataset. The SQLDataset may not be necessary if we just have a generic ibis dataset. Is it more confusing if the ibis.TableDataset takes sql as argument?
I think ibis SQL support is strictly better than pandas, and I do want to recommend people to switch over at the same time not forcing them to learn everything about ibis, rather just as a universal database SQL dataset.
If anything, based on your justifications above, it still feels more natural to just have the SQL logic in the nodes. Maybe - It's always a bit blurry if I am selecting a few columns (is it I/O or transformation logic?). I do see the point it's not idiomatic to Kedro, but at the same I find it weird to have a node with just SQL string inside (plus this doesn't have proper format support etc), then you end up moving the SQL back to a file, and then try to load up a file to pass it to a node. The Kedro way of doing this feels a bit over engineered just to fit things how Kedro prefer.
Is it more confusing if the
ibis.TableDatasettakessqlas argument?
Yes, I think this is more confusing. :)
I think
ibisSQL support is strictly better thanpandas, and I do want to recommend people to switch over at the same time not forcing them to learn everything aboutibis, rather just as a universal database SQL dataset.
That's a fair point.