kedro
kedro copied to clipboard
[KED-2821] Convert all imports according to packaging specs in pyproject.toml
At kedro pipeline package
time, check if there are any packaging specs in pyproject.toml (keys under [tool.kedro.pipeline.package]
). If so, refactor all imports according to the aliasing made explicit in the manifest.
[tool.kedro.pipeline.package]
"pipelines.alpha" = {alias = beta}
"pipelines.alpha.utils" = {alias = sigma}
"other_package" = {alias = utilities, destination = "dist/libraries"}
In the example above, utils should be renamed to sigma (and any references inside it to alpha should refer to beta), alpha to beta (and any references inside it to utils should refer to sigma), etc.
❓ How to work out the ideal order?
❓ Would working on a shadow-copy of the whole (or part of the) project help here?
❓ Checkout how nested packages would work, e.g. the alpha
and alpha.utils
. Consider doing top-level packages first, and then the nested ones.
For more context see comments on the original issue, copied over below.
We talked about the experience of packaging/pulling modular pipelines, and how we can actually think of it as micro-packaging
any part of the Python code in a Kedro project, at arbitrary depth in the nested folder structure. We can think of pyproject.toml
as an additional requirements.txt
, which deals with "internal" dependencies.
Worked example detailed in https://github.com/kedro-org/kedro/issues/1304#issuecomment-1055387856.
Main ideas & proposed actions:
- Switch from
.whl
tosdist
(built distribution vs source distribution, which is what we actually need)
Our wheels currently look like two packages (one for tests, one for src), and because of that it wouldn't be guaranteed to work with any other Python package. The less customisation we have the better.
sdist
seems like the appropriate standard for this (sharing source code), and tests can be specified in sdist's metadata, rather than another package. Users can generate awheel
file fromsdist
, which makes it more flexible if they want to share wheels or source code with the client.
- Allow alias to be a Python module path ↔️ can package any subsection of the project
- Make packaging work with any path relative to
src/<package_name>
, i.e. not justpipelines.
↔️ can package utilities - Aliasing improvements
- Phase I: Look at import paths: if self-contained to that module, package; else, raise error (upstream importing is not allowed).
- Phase II (on top of phase I): Look at import paths: if in
pyproject.toml
, rework imports to new location/name; else, raise error.
Out of scope: When a path is added to the PYTHONPATH
and the imports look like 3rd party, when they're actually from the local project, somewhere upstream. We treat it as third party and that's it.
Open question: ❓ How to declare the explicit dependency between micro-packages ❓ (also how do we know if the second micro-package should come as source code or directly as packaged asset?)
Translating whiteboard exercise
In a tree like:
<package_name>
|__ pipelines
|__ clean_pipeline
|__ nullify
|__ pipeline.py
|__ nodes.py
|__ utils
|__ features
|__ cleaning
|__ impute
|__ main.py
|__ sth.py
|__ nullify
|__ func.py
|__ sth.py
With a pyproject.toml
like:
[tool.kedro.pipeline.package]
"utils.cleaning.impute" = {alias = "banking_impute"}
"utils.cleaning.nullify" = {alias = "banking_nullify"}
[tool.kedro.pipeline.pull]
"banking_impute" = {alias = "pkg.utils.cleaning"}
If we try to package nullify
from clean_pipeline
(i.e. pipelines.clean_pipeline.nullify
), we look at the imports inside:
from pkg.pipelines.clean_pipeline.nulify import nodes # can package, no problem
from pkg.utils.cleaning.impute import main # turn into `from banking_impute import main`, then package
from pkg.utils.features.windowing import daily # can't package, needs to be at least in `pyproject.toml`
As mentioned before, upstream imports are not allowed. In practice: if we try to package utils.cleaning.impute
, the following import would make the packaging workflow error out:
from pkg.utils import another_utility
Open question: ❓ How to declare the explicit dependency between micro-packages ❓ (also how do we know if the second micro-package should come as source code or directly as packaged asset?)
This question is being addressed separately in #1478
This issue was discussed in Technical Design on 14/12
- In https://github.com/kedro-org/kedro/pull/2119 it was discovered that the feature to add
micropkg
specs inpyproject.toml
hasn't worked since March 2022. Nobody has reported this. - The team agreed that before implementing more advanced functionality to the
micropkg
specs inpyproject.toml
, we need to investigate if users are using this functionality. https://github.com/kedro-org/kedro/issues/2123 - It would also be worth talking to the Alloy team and hearing how they do packaging and if they find the micro-packaging functionality in Kedro useful or not.
Is this whole import mangling worth the effort? Maybe we could tell users "refactor common code/functions that are not in the pipeline itself as a separate package, then declare it as a dependency".
(Simplistic question meant to spark discussion and, in line with what @merelcht said right above, talk to our users about the micropackaging workflow).
I think we need to decide what to do with the micropackaging feature as a whole before introducing any more features like the suggested one here.
Closing for now, we can always revisit in the future