kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

feat(datasets) Experimental - sqlframe datasets

Open datajoely opened this issue 1 year ago • 2 comments

Description

Leveraging sqlframe a new dataframe library which targets SQL backends (e.g. duckdb/bigquery/postgres) but exposes the PySpark API data frame syntax.... without the JVM or actually running Spark itself.

This has two major benefits for users:

  • Like Ibis it allows users to leverage SQL platforms as an execution engine in addition to a storage engine. Approaches like our pandas.SQLTableDataset are naive in the sense they don't use the SQL engine for processing, only storage.
  • For users already accustomed to Spark syntax or brownfield projects already written in spark this provides a low-friction adoption route.

Development notes

  • This has been tested locally in the terminal, I've not yet written formal tests. Experimental mode baby 😎 .
  • I've also done some funky OmegaConf resolver stuff so that the SQL connection can be lazily defined in YAML without creating a super complicated dataset class whilst still supporting dynamic switching of back-ends.

Checklist

  • [ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • [x] Updated the documentation to reflect the code changes
  • [ ] Added a description of this change in the relevant RELEASE.md file
  • [ ] Added tests to cover my changes

datajoely avatar May 22 '24 16:05 datajoely

What do we want to do about this PR/dataset?

noklam avatar Jul 23 '24 11:07 noklam

@noklam I'm going to redo it, v2.0.0 just came out and it makes the API much better

datajoely avatar Aug 02 '24 10:08 datajoely

@datajoely Are you still keen on polishing this? And maybe a silly question but this looks very similar to the two Ibis datasets we now have - does it make sense to have both?

merelcht avatar Oct 18 '24 11:10 merelcht

Hi @merelcht I'm going to close this and reopen this at a later date - the library was changing at a rate of knotty so I held off finishing this.

I do think both this and Ibis deserve to be supported in Kedro - the onboarding penalty in Ibis is the sticking point as you need to change your existing codebase. This is much easier to adopt for existing Spark users and unlocks a bunch of different execution engines.

My recommendation would be use Ibis if you're starting something new, use SQLFrame if you're thinking about migrating an existing project off Spark.

datajoely avatar Oct 18 '24 12:10 datajoely