kedro
kedro copied to clipboard
Package `kedro.extras.datasets` into its own `kedro-datasets` package
Problem
Currently Kedro as a framework and kedro.extras.datasets
are one package and this has caused a number of issues:
- Kedro Framework was held back adding support for Python 3.9/3.10 by breaking changes in the datasets dependencies
- Adding support for newer versions of the datasets was held back by the need for more conservative versioning of the framework
- The difference of breaking change velocity between the Framework (more stable) and the datasets (less stable) meant a painful accounting of what was new in 0.18 and what was merely a change in a dataset
Staged approach
Ripping kedro.extras.datasets
from the framework is itself a breaking change, but we can phase this out in two stages:
- Make a separate package and alias everything in
kedro.extras.datasets
to that new package, including installingkedro[xxx]
- When we ship 0.19, we can officially remove
kedro.extras.datasets
and now bothkedro
andkedro-datasets
will be installed separately
Tasks for first stage (0.18.X)
- [x] #1492
- [x] Copy all datasets to the new
kedro-datasets
package including the relevantextras_require=
https://github.com/kedro-org/kedro/blob/1b1558952c059eea5636d9ccf9a883f9cf4ef643/setup.py#L107-L148 - [x] Move all dataset tests to
kedro-datasets
- [x] Copy all datasets to the new
- [x] #1493 ~- [x] https://github.com/kedro-org/kedro/issues/1693~
- [ ] #1850
- [ ] #1494
- [ ] #1711
~-[ ] #1501~ This no longer needs to be done after it was decided to namespace
kedro-datasets
- [ ] #1495
- [ ] #1497
- [ ] #1651
- [ ] Alter documentation from
pip install kedro[...]
to the newpip install kedro kedro-datasets[...]
Tasks for second stage (preparing release 0.19)
- [ ] Make Kedro not instantiate datasets from
kedro.extras.datasets
- [ ] Remove all datasets-related
extras_require=
fromkedro
- [ ] Add
kedro-datasets
inkedro[all]
- [ ] Rename all mentions of
DataSet
toDataset
- [ ] Remove
kedro.extras.datasets
(orkedro.extras
if all other things are moved/removed) - [ ] #1498
- [ ] #1502
This is great! 🚀 Just to wanted another motivation for doing this, and also some more tasks that will be needed. Datasets and their dependencies are the number 1 cause of random CI failures which have slowed us down quite a lot. I presume that your intention was that all the dataset tests should move to kedro-datasets
also? If so then I'd add to "tasks for first stage":
- [x] Move all dataset tests to
kedro-datasets
- [ ] Try and figure out a nicer way of installing all the dependencies listed in the setup.py file rather than needing to maintain a separate
test_requirements.txt
file there - [ ] Clean up kedro's test_requirements.txt to remove requirements that aren't needed any more
- [ ] Clean up kedro's CI config and Makefile. e.g. no need to
conda install pytables
any more or have the separatemake test-no-spark
. There's quite a few dataset-specific workarounds that we could remove
Also I think we should have:
- [ ] Add deprecation warnings to all the
kedro.extras.datasets
(or theAbstractDataset
classes? Not sure whether they're also moving) to explain they're moving
One small question: were you thinking of kedro-datasets
living in kedro-plugins
or a whole new repo? It's not a plugin in the strict sense of the word since it doesn't use entrypoints, but maybe it's easier to maintain if it's in the monorepo.
One small question: were you thinking of
kedro-datasets
living inkedro-plugins
or a whole new repo? It's not a plugin in the strict sense of the word since it doesn't use entrypoints, but maybe it's easier to maintain if it's in the monorepo.
I like this suggestion! True kedro-datasets
isn't technically a plugin, but if we add it to kedro-plugins
we can make use of the already existing CI setup and add in automatic push to pypi
for all plugins in one go.
Some more things we should do as part of never work in kedro.extras.datasets
is make sure this is clear to open source contributors by adding it to the contribution guide, and at the top of every dataset class and perhaps also in the PR template.
Follow up on some unresolved issues discussed in https://github.com/kedro-org/kedro-plugins/pull/38
- How will docs work for
kedro-datasets
? We will need to generate documentation (at least the dataset part) fromkedro-datasets
repo. - Deepyaman has an idea with namespace package (i.e. we keep the namespace
kedro.extra.datasets
in thekedro-datasets
repo.
Here's just a wild idea, I've played a bit with Python's meta_path
and came up with an interesting PoC about aliasing imports globally as shown here:
import sys
import importlib
from importlib.abc import MetaPathFinder
class Custom(MetaPathFinder):
def find_spec(fullname, path, target=None):
if not 'kedro.datasets' in fullname:
return None
return importlib.util.find_spec('kedro_datasets')
sys.meta_path.append(Custom)
import kedro.datasets as kd
The code here allows you to import anything from kedro_datasets
as if it were kedro.datasets
. I did not explore it enough for potential problems, but maybe we can investigate whether we can somehow create the alias kedro.datasets
work out of the box for any Kedro project? WDYT?
The modification of the meta_path
can happen in configure_project
so it works in all entrypoints. I am not sure how it will work with language servers or IDE support though...
@idanov That's very interesting, and I didn't know about this functionality. ~That being said, if we're going to investigate this, I think we should prioritize spending some time to make it work using namespace packages, because that would ideally be the "standard" solution to the problem (that doesn't involve path hacking).~ I just saw some of the discussion around why namespace packages may not work in https://github.com/kedro-org/kedro/issues/1758, and even though I don't fully understand all of the reasons having not spent time on it (except for some of the challenges with pip install -e
), this could be a good alternative.
Closing this because all sub-tasks are done and Kedro 0.19.0
and kedro-datasets 2.0.0
will be released soon.