kedro-plugins
kedro-plugins copied to clipboard
feat(datasets) Experimental - sqlframe datasets
Description
Leveraging sqlframe a new dataframe library which targets SQL backends (e.g. duckdb/bigquery/postgres) but exposes the PySpark API data frame syntax.... without the JVM or actually running Spark itself.
This has two major benefits for users:
- Like Ibis it allows users to leverage SQL platforms as an execution engine in addition to a storage engine. Approaches like our
pandas.SQLTableDatasetare naive in the sense they don't use the SQL engine for processing, only storage. - For users already accustomed to Spark syntax or brownfield projects already written in spark this provides a low-friction adoption route.
Development notes
- This has been tested locally in the terminal, I've not yet written formal tests. Experimental mode baby 😎 .
- I've also done some funky
OmegaConfresolver stuff so that the SQL connection can be lazily defined in YAML without creating a super complicated dataset class whilst still supporting dynamic switching of back-ends.
Checklist
- [ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
- [x] Updated the documentation to reflect the code changes
- [ ] Added a description of this change in the relevant
RELEASE.mdfile - [ ] Added tests to cover my changes
What do we want to do about this PR/dataset?
@noklam I'm going to redo it, v2.0.0 just came out and it makes the API much better
@datajoely Are you still keen on polishing this? And maybe a silly question but this looks very similar to the two Ibis datasets we now have - does it make sense to have both?
Hi @merelcht I'm going to close this and reopen this at a later date - the library was changing at a rate of knotty so I held off finishing this.
I do think both this and Ibis deserve to be supported in Kedro - the onboarding penalty in Ibis is the sticking point as you need to change your existing codebase. This is much easier to adopt for existing Spark users and unlocks a bunch of different execution engines.
My recommendation would be use Ibis if you're starting something new, use SQLFrame if you're thinking about migrating an existing project off Spark.