kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Package `kedro.extras.datasets` into its own `kedro-datasets` package

Open idanov opened this issue 2 years ago • 3 comments

Problem

Currently Kedro as a framework and kedro.extras.datasets are one package and this has caused a number of issues:

  • Kedro Framework was held back adding support for Python 3.9/3.10 by breaking changes in the datasets dependencies
  • Adding support for newer versions of the datasets was held back by the need for more conservative versioning of the framework
  • The difference of breaking change velocity between the Framework (more stable) and the datasets (less stable) meant a painful accounting of what was new in 0.18 and what was merely a change in a dataset

Staged approach

Ripping kedro.extras.datasets from the framework is itself a breaking change, but we can phase this out in two stages:

  1. Make a separate package and alias everything in kedro.extras.datasets to that new package, including installing kedro[xxx]
  2. When we ship 0.19, we can officially remove kedro.extras.datasets and now both kedro and kedro-datasets will be installed separately

Tasks for first stage (0.18.X)

  • [x] #1492
    • [x] Copy all datasets to the new kedro-datasets package including the relevant extras_require= https://github.com/kedro-org/kedro/blob/1b1558952c059eea5636d9ccf9a883f9cf4ef643/setup.py#L107-L148
    • [x] Move all dataset tests to kedro-datasets
  • [x] #1493 ~- [x] https://github.com/kedro-org/kedro/issues/1693~
  • [ ] #1850
  • [ ] #1494
  • [ ] #1711 ~-[ ] #1501~ This no longer needs to be done after it was decided to namespace kedro-datasets
  • [ ] #1495
  • [ ] #1497
  • [ ] #1651
  • [ ] Alter documentation from pip install kedro[...] to the new pip install kedro kedro-datasets[...]

Tasks for second stage (preparing release 0.19)

  • [ ] Make Kedro not instantiate datasets from kedro.extras.datasets
  • [ ] Remove all datasets-related extras_require= from kedro
  • [ ] Add kedro-datasets in kedro[all]
  • [ ] Rename all mentions of DataSet to Dataset
  • [ ] Remove kedro.extras.datasets (or kedro.extras if all other things are moved/removed)
  • [ ] #1498
  • [ ] #1502

idanov avatar Apr 20 '22 18:04 idanov

This is great! 🚀 Just to wanted another motivation for doing this, and also some more tasks that will be needed. Datasets and their dependencies are the number 1 cause of random CI failures which have slowed us down quite a lot. I presume that your intention was that all the dataset tests should move to kedro-datasets also? If so then I'd add to "tasks for first stage":

  • [x] Move all dataset tests to kedro-datasets
  • [ ] Try and figure out a nicer way of installing all the dependencies listed in the setup.py file rather than needing to maintain a separate test_requirements.txt file there
  • [ ] Clean up kedro's test_requirements.txt to remove requirements that aren't needed any more
  • [ ] Clean up kedro's CI config and Makefile. e.g. no need to conda install pytables any more or have the separate make test-no-spark. There's quite a few dataset-specific workarounds that we could remove

Also I think we should have:

  • [ ] Add deprecation warnings to all the kedro.extras.datasets (or the AbstractDataset classes? Not sure whether they're also moving) to explain they're moving

One small question: were you thinking of kedro-datasets living in kedro-plugins or a whole new repo? It's not a plugin in the strict sense of the word since it doesn't use entrypoints, but maybe it's easier to maintain if it's in the monorepo.

antonymilne avatar Apr 21 '22 07:04 antonymilne

One small question: were you thinking of kedro-datasets living in kedro-plugins or a whole new repo? It's not a plugin in the strict sense of the word since it doesn't use entrypoints, but maybe it's easier to maintain if it's in the monorepo.

I like this suggestion! True kedro-datasets isn't technically a plugin, but if we add it to kedro-plugins we can make use of the already existing CI setup and add in automatic push to pypi for all plugins in one go.

Some more things we should do as part of never work in kedro.extras.datasets is make sure this is clear to open source contributors by adding it to the contribution guide, and at the top of every dataset class and perhaps also in the PR template.

merelcht avatar Apr 26 '22 14:04 merelcht

Follow up on some unresolved issues discussed in https://github.com/kedro-org/kedro-plugins/pull/38

  1. How will docs work for kedro-datasets? We will need to generate documentation (at least the dataset part) from kedro-datasets repo.
  2. Deepyaman has an idea with namespace package (i.e. we keep the namespace kedro.extra.datasets in the kedro-datasets repo.

noklam avatar Jun 27 '22 22:06 noklam

Here's just a wild idea, I've played a bit with Python's meta_path and came up with an interesting PoC about aliasing imports globally as shown here:

import sys
import importlib
from importlib.abc import MetaPathFinder

class Custom(MetaPathFinder):
    def find_spec(fullname, path, target=None):
        if not 'kedro.datasets' in fullname:
            return None
       return importlib.util.find_spec('kedro_datasets')

sys.meta_path.append(Custom)

import kedro.datasets as kd

The code here allows you to import anything from kedro_datasets as if it were kedro.datasets. I did not explore it enough for potential problems, but maybe we can investigate whether we can somehow create the alias kedro.datasets work out of the box for any Kedro project? WDYT?

The modification of the meta_path can happen in configure_project so it works in all entrypoints. I am not sure how it will work with language servers or IDE support though...

idanov avatar Jan 03 '23 12:01 idanov

@idanov That's very interesting, and I didn't know about this functionality. ~That being said, if we're going to investigate this, I think we should prioritize spending some time to make it work using namespace packages, because that would ideally be the "standard" solution to the problem (that doesn't involve path hacking).~ I just saw some of the discussion around why namespace packages may not work in https://github.com/kedro-org/kedro/issues/1758, and even though I don't fully understand all of the reasons having not spent time on it (except for some of the challenges with pip install -e), this could be a good alternative.

deepyaman avatar Jan 03 '23 13:01 deepyaman

Closing this because all sub-tasks are done and Kedro 0.19.0 and kedro-datasets 2.0.0 will be released soon.

merelcht avatar Nov 27 '23 11:11 merelcht