kedro-plugins
kedro-plugins copied to clipboard
feat(datasets): Added the Experimental PolarsDatabaseDataset
Description
This PR adds the PolarsDatabaseDataset to support interactions with databases using Polars.
Fixes https://github.com/kedro-org/kedro-plugins/issues/853
Development notes
Quite a bit of code has been copied over from SQLQueryDataset in this implementation.
These changes have been tested,
- Manually, by running the code locally to load and save Polars DataFrames from and to SQLite files. ~~2. Via the existing and newly added unit tests.~~
Checklist
- [X] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
- [ ] Updated the documentation to reflect the code changes
- [ ] Added a description of this change in the relevant
RELEASE.mdfile - [ ] Added tests to cover my changes
- [ ] Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)
Hey @noklam, @deepyaman,
I was able to come up with this implementation for the PolarsDatabaseDataset by extending SQLQueryDataset and it seems to work quite well (at least load() does).
Should we implement save() as well? This would require a table name to be provided as parameter.
Or do you have different thoughts on how this dataset ought to be implemented?
Hey @noklam, @deepyaman, I was able to come up with this implementation for the
PolarsDatabaseDatasetby extendingSQLQueryDatasetand it seems to work quite well (at leastload()does).
Sorry for the late response; didn't see this.
While there may be opportunities to reduce code copying more broadly, most datasets just inherit from AbstractDataset or AbstractVersionedDataset. Here, inheriting from pandas.SQLQueryDataset adds a pandas dependency, so we shouldn't do that.
Should we implement
save()as well? This would require a table name to be provided as parameter.
Probably, because Polars supports it. You can make the table name optional but require it for save.
Hey @noklam and @deepyaman,
I've refactored this by removing the dependency on pandas.SQLQueryDataset as @deepyaman suggested. I copied over quite a bit of code from there though since these work quite similarly. I think the duplication can definitely be removed to a large extent by moving some of the common functions to _utils. Should I handle this in a different PR since pandas.SQLQueryDataset is already a fixed (non-experimental) dataset?
Let me know what you think and I will add the finishing touches to this PR (tests and so forth).
Please add to https://github.com/MinuraPunchihewa/kedro-plugins/blob/feature/polars_database_dataset/kedro-datasets/docs/pages/api/kedro_datasets_experimental/index.md so the documentation appears in https://kedro--990.org.readthedocs.build/projects/kedro-datasets/en/990/pages/api/kedro_datasets/ 🙏🏼
Done!
Now https://kedro--990.org.readthedocs.build/projects/kedro-datasets/en/990/pages/api/kedro_datasets_experimental/polars.PolarsDatabaseDataset.md gives HTTP 404 though 👀
Now https://kedro--990.org.readthedocs.build/projects/kedro-datasets/en/990/pages/api/kedro_datasets_experimental/polars.PolarsDatabaseDataset.md gives HTTP 404 though 👀
Oh, apologies. I thought I had already added the docs, but I haven't. I will do it soon.
Hi @MinuraPunchihewa, just following up - do you have a plan to complete that update?
I'm interested in this dataset 😄 will try to bring it back to life
This is now working https://kedro--990.org.readthedocs.build/projects/kedro-datasets/en/990/api/kedro_datasets_experimental/polars.PolarsDatabaseDataset/
Ready for a round of code reviews ✔️
Apologies, guys. I could not find the time earlier to finish this up. @astrojuanlu Thank you very much for your help.
I'll address @ElenaKhaustova 's comment ☝🏼 but how do I fix the detect-secrets thing?
I'll address @ElenaKhaustova 's comment ☝🏼 but how do I fix the detect-secrets thing?
I've addressed the comment and fixed the detect-secrets check as well.