kedro
kedro copied to clipboard
Poetry Support for Kedro Projects
Description
The way kedro initiate a new project and create the folder structure does not goes well with Poetry . Usually I would create a Poetry environment before doing anything and then install all my required pacakges one by one. After I create a Poetry environment and added the kedro package the pyproject toml looks as followes:
poetry new --src KedroPoetry
[tool.poetry]
name = "KedroPoetry"
version = "0.1.0"
description = ""
authors = ["Your Name <[email protected]>"]
[tool.poetry.dependencies]
python = "^3.8"
kedro = {version = "~0.18.2", python = ">=3.8,<3.11"}
[tool.poetry.dev-dependencies]
pytest = "^7.1"
... (more lines)
Lets run the demo pytest to see if everything works.
poetry run pytest .
This goes well:
KedroPoetry|⇒ poetry run pytest .
platform darwin -- Python 3.8.12, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/ALAMSHC/PythonProjects/KedroPoetry
collected 1 item
tests/test_kedropoetry.py . [100%]
Now its time to add a Kedro Project: kedro new
The command completely ignored the current pyproject.toml
file. and as there is a src file it did not add the project in the src folder instead create a directory on the root outside of src. Now there is no kedro setup section in pyproject.toml so kedro cli will complain for broken setup.
Context
As Poetry provide one of the modern approach for packaging Python projects it will be good to have direct support for Poetry like project structure for Kedro or at-least a hackable way out will also work.
Possible Implementation
There could be a new flag in cli to initiate project with Kedro when there is already a pyproject.toml file and a project setup for Poetry.
Thanks for raising the issue, would be great if you can provide some kind of tree
output to show the folder structure. I haven't used poetry
myself. Better if you can create a demo Github repository so I can clone and playaround with it.
In general, a "Kedro Project" itself is the top directory, do you currently have a workaround?
let say your new project is called new_project
-
kedro new
# STDIN = new_project - copy everything out from the directory 1 level-up (I imagine you have to manually merge the
pyproject.toml
as well)
Sure On my way. Will share a git rep soon.
@noklam please find the demo project in the following git repo. The code generation steps and some expected Ideas are given the README file in the repository.
The way kedro initiate a new project and create the folder structure does not goes well with Poetry . Usually I would create a Poetry environment before doing anything and then install all my required pacakges one by one.
@DataPsycho poetry new
creates a Poetry project template, whereas kedro new
creates a Kedro project template (which, by default, is pip
-based). However, Kedro also provides the ability to use other templates, though Kedro starters. I think it would make sense to create a Poetry starter for Kedro, if want kedro new
to play optimally with Poetry. You wouldn't use the poetry new
command in that case, but you'd get a Poetry-compatible project.
First of all, thank you @DataPsycho for writing a very detailed README which is easy to follow. I think @deepyaman approach is preferrable.
(kedropoetry-9Q6y5a-v-py3.8) datapsycho@dataops:~/.../KedroPoetry$ tree . -L 1
.
├── poetry.lock
├── pyproject.toml
├── README.md
├── sample-project
├── src
└── tests
With the structure that you provided, basically you need to copy everything inside sample-project
to the same directory. I think you will have crashes on README.md
, pyproject.toml
Here are the potential alternatives I am thinking about:
Starting from a fresh project with kedro new --starter=poetry
It should just work, no extra file deletion needed
Starting from an existing poetry project
kedro new --starter=poetry # assume project named sample-project
kedro run # Should just work out of the box
Then you will need to copy out the file from sample-project
to the same directory
You will still have crashes on these file since I think it's not straight forward to auto-resolve/merge these file
❌ README.md
❌ pyproject.toml. # library dependecies etc
✅ There will be no extra setup.py
to delete
✅ There will be no requirements.txt
to delete
So you will save 2 delete
options with this workflow, but you still have to deal with resolving the dependencies. It would be easier to just start with a poetry-compatible project from the start. Thoughts?
Hi, There is more file to move around.
(kedropoetry-9Q6y5a-v-py3.8) datapsycho@dataops:~/.../KedroPoetry$ tree . -L 1
.
├── poetry.lock
├── pyproject.toml
├── README.md
├── sample-project
├── src
└── tests
After I get that structure. I have to do the following moving:
- Copy The stuff from sample-project > pyproject.tom into the pyproject.toml
- Copy conf, data, notebook, docs, logs into project root
- Copy sample-project > src > sample_project project into src > sample_project
- Copy sample-project > src > tests
into src > tests > - Install all the required package necessary and delete the sample-project directory Now The Kedro cli and poetry is in harmony.
I always have to start with poetry first. Using poetry I have add kedro as a package for the virtual environment of the project. Then I am able to use kedor. But reverse is not what a poetry use would do: Create a venv install poetry in it and activate it. then create a new project with kedro and go inside of the project then start poetry into the repo which will create another new poetry-venv. now old venv will have no use.
@DataPsycho Is there any difference that just go inside sample-project
and select & cut all and paste 1 level up?
I don't quite understand why it necessarily create another poertry environment, I may test it out tomorrow.
Its fine to do the copy pasting. But Kedro is a package with cli. But Poetry is an environment management and package management system. How do I use kedro from the start without poetry or Pipenv?
- Create a virtualenv
- Install Kedro with
pip
- Init a project with Kedro
- cd into the project and install the packages in
src/requirements.txt
If I install Kedro in the base python image:
- Using base Kedro Init a Project
- Create a virtual environment for the project
- Install all the packages from
src/requirements.txt
which kedro generates
But now I am locked with the Kedro version, I can not move between versions. I have to create all my projects with same Kedro version. So this is a no go for me.
If want to use Poetry:
- I have to create a poetry project
poetry --src myproject
- cd into the project and add kedro as a package dependency with
poetry add kedro
Poetry will create a virtualenv while install ing Kedro for that particular project PIPENV will do the same actually - Now I have to initialize kedro project and start copy pasting stuff and restructure manually explained above
To be able to use Kedro first I must have to create a virtual environment first and install Kedro in it. But Poetry responsible for creating a virtual environment and adding Kedro init. So If want to use Kedro to initialize a project I need a virtualenv with kedro but then after initialization of the project when I cd into the project and initialize poetry with poetry init
then poetry would want to create another environment and the previous Kedro virtual environment which is used to initialize the kedro project will have no use. So in that way I will have to create 2 environment and delete the first one if I wan to use Poetry.
@DataPsycho I agree this is not the smoothest experience.
I just want to mention that your kedro
version doesn't necessarily tie to your new
project version. By default if you have 0.18.1, it will generate a 0.18.1
Kedro template, but you can override that default if necessary. So this may be a workaround if you need to create new Kedro projects frequently.
--checkout TEXT An optional tag, branch or commit to checkout in the starter repository.
Ok. Then we can close the feature request I guess. Thanks for your support and the time you have spent. @deepyaman 's Idea was great. I will see If I will have time to create a new starter for poetry like project structure. For now we can close it. I will close it by tomorrow, if you have nothing to add. Thanks
@DataPsycho This example that we have in test is a good starting point. https://github.com/kedro-org/kedro/blob/main/features/steps/test_plugin/plugin.py
You can find more info how you can extend it with kedro new --starter=custom_starter
in this link.
https://kedro.readthedocs.io/en/0.18.2/extend_kedro/plugins.html?highlight=kedrostarterspec#extend-starter-aliases
A new starter might be added for poetry/PIPENV.
I'm reopening this because I think it's a very good topic and I'd be interested in hearing from other users about it 🙂 It's been mentioned several times before by differently people but we've never had thoughts collected together in one place, so let's start doing that here! In the past we've also wondered whether we should switch to using poetry. Currently we support a pip-compile
workflow but we're planning to remove that in favour of just a plain requirements.txt
file. Given https://github.com/kedro-org/kedro/issues/1724, it might be time to re-assess what system we use exactly.
Some previous related issues (there's probably others too): https://github.com/kedro-org/kedro/issues/398 https://github.com/kedro-org/kedro/issues/391
From these and other conversations I know the following users have independently shown interest in kedro + poetry. There's also been interest within QB, though I'm not sure exactly who. So I definitely think there's some significant interest in this. @datajoely do you know anyone else here? @fkromer @danhje @Kastakin @Larkinnjm1 @shaunc
Carlos Bareto, but I don't know his GitHub handle
TBH, I like the idea of adding support for Poetry in Kedro projects. I think the main advantages of Poetry are:
- it's a widely used package manager (more than pip-compile at least)
- eliminate the need for
setup.py
- it provides a better way to manage project/dev requirements.
I agree with @arnaldog12. Also, I integrated my current project with Poetry. If you want, I can share that as a poetry starter.
Much appreciate the initiative. Happy to share any knowledge needed which I have already tried to develop the starter template.
One note for posterity on using Poetry with Kedro projects--there was a fix that's especially relevant to Kedro projects added in Poetry 1.2.0b3. Before this, you need to make sure to define any extras like pandas.csvdataset
in all lowercase.
So if I'm new to poetry and new to kedro, and I've installed poetry 1.2.0rc1
, what's the best way to proceed at this point?
So if I'm new to poetry and new to kedro, and I've installed poetry
1.2.0rc1
, what's the best way to proceed at this point?
If still there is no better way, follows this thread above what I had to make kedro compatible with poetry
I've read both this and the closed issue I haven't found any mention of the relationship between kedro run
and poetry run
. You might assume that the user always execute poetry shell
but in truth the correct way to execute things within a poetry environment is using poetry run
.
So assuming I have a starter which is both kedro and poetry compliant, do we expect to use poetry run kedro run...
?
I've read both this and the closed issue I haven't found any mention of the relationship between
kedro run
andpoetry run
. You might assume that the user always executepoetry shell
but in truth the correct way to execute things within a poetry environment is usingpoetry run
. So assuming I have a starter which is both kedro and poetry compliant, do we expect to usepoetry run kedro run...
?
kedro
provides an entrypoint for your project in the __main__.py
file. So if you add the following to your pyproject.toml
[tool.poetry.scripts]
my_project = "my_project.__main__:main"
you can run poetry run my_project -p ...
which feels pretty natural.
For folks subscribed to this old issue: we're (1) modernizing the way Kedro projects are structured, to make them look more similar to normal Python libraries https://github.com/kedro-org/kedro/milestone/36 and (2) looking into ways to initialize a Kedro project in an existing directory #2512.
Our idea though is to favor PEP 621 compliant pyproject.toml
files, which are not yet supported by Poetry https://github.com/python-poetry/poetry/issues/3332 so it will still take us some time to get there. The good news is that we would be very close to actual support, and maybe by that time Poetry will be soft-compatible with PEP 621 already.
Today I found a project that uses Poetry + Kedro: https://github.com/madziejm/project-fontr
People subscribed to this issue, could you have a look and let us know what else can we do to better support this use case? Otherwise I'm voting to close the issue.
Hi! Great that you are working on it and hopefully poetry moves to PEP 621 soon :)
For people wondering how you can use a conda-poetry-kedro setup for now, I use it as follow:
- set up conda and install poetry and kedro
conda create --name myenv
conda activate myenv
conda install -c conda-forge poetry
pip install kedro
-
create new kedro project
kedro new
-
init poetry env (within the activated conda env)
poetry init
-
install the "kedro" dependencies
src\requirements.txt
in the conda env
poetry add "black~=22.0"
poetry add "flake8>=3.7.9,<5.0"
poetry add "ipython>=7.31.1, <8.0; python_version < '3.8'"
poetry add "ipython~=8.10; python_version >= '3.8'"
poetry add "isort~=5.0"
poetry add "jupyter~=1.0"
poetry add "jupyterlab~=3.0"
poetry add "jupyterlab~=3.0"
poetry add "kedro~=0.18.13"
poetry add "kedro-datasets[pandas.CSVDataSet, pandas.ExcelDataSet, pandas.ParquetDataSet]~=1.0"
poetry add "kedro-telemetry~=0.2.0"
poetry add "kedro-viz~=6.0"
poetry add "nbstripout~=0.4"
poetry add "pytest-cov~=3.0"
poetry add "pytest-mock>=1.7.1, <2.0"
poetry add "pytest~=7.2"
poetry add "scikit-learn~=1.0"
I use poetry add
to ensure that all dependencies are stored in the pyproject.toml, but you could also install them directly using poetry run pip install -r src/requirements.txt
. In that case they are not registered in the pyproject.toml but are installed in your env.
-
Delete redundant files
src\pyproject.toml
andsrc\requirements.txt
. -
Run kedro in conda-poetry with
kedro run
You should end up with a pyproject.toml looking like this (see .txt), which you can then use in the future to init your poetry env directly using poetry install --no-root
.
Hi @ac-willeke
Thanks for the tips. But, doesn't using conda
along with poetry
seem to be a redundant tool? Both are equally used for virtual environment and package manager.
Hi @ac-willeke
Thanks for the tips. But, doesn't using
conda
along withpoetry
seem to be a redundant tool? Both are equally used for virtual environment and package manager.
Many use conda to specify python version within the virtual env, another option is pyenv
Hi!
Yes, I agree conda/poetry is redundant. I used to combine the two in projects with libraries that are not easily installed using poetry. For example, python bindings for gdal (dependent on C++) are not that easy to install if you don't have admin rights. So then I would start my project with conda, install gdal, install all other packages using poetry (as I like the clean structure of poetry).
But I recently moved to gdal images from docker, so then you can use solely poetry as a package manager :)
So maybe my example above with the conda/poetry env was not the best, sorry for that!
Did some experiments today and I confirm Kedro supports Poetry. Or Poetry supports Kedro, depending on how you want to look at it.
Starting point:
❯ tree
.
├── README.md
├── pyproject.toml
└── src
└── test_poetry
└── __init__.py
3 directories, 3 files
❯ cat pyproject.toml
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
[tool.poetry]
name = "test-poetry"
version = "0.1.0"
description = ""
authors = ["Juan Luis Cano Rodríguez <[email protected]>"]
readme = "README.md"
[tool.poetry.dependencies]
python = "^3.10"
Then added the necessary files (for example using https://github.com/astrojuanlu/kedro-init):
❯ kedro-init .
[00:05:14] Looking for existing package directories cli.py:25
[00:05:20] Initialising config directories cli.py:25
Creating modules cli.py:25
🔶 Kedro project successfully initialised! cli.py:26
❯ tree
.
├── README.md
├── conf
│ ├── base
│ └── local
├── pyproject.toml
└── src
└── test_poetry
├── __init__.py
├── pipeline_registry.py
└── settings.py
6 directories, 5 files
❯ git diff
diff --git a/pyproject.toml b/pyproject.toml
index 26ac21c..cdcbbd4 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -11,3 +11,8 @@ readme = "README.md"
[tool.poetry.dependencies]
python = "^3.10"
+
+[tool.kedro]
+project_name = "test-poetry"
+package_name = "test_poetry"
+kedro_init_version = "0.18.14"
Now everything works:
❯ kedro registry list
- __default__
❯ kedro pipeline create data_processing
Using pipeline template at: '/private/tmp/test-poetry/.venv/lib/python3.10/site-packages/kedro/templates/pipeline'
Creating the pipeline 'data_processing': OK
Location: '/private/tmp/test-poetry/src/test_poetry/pipelines/data_processing'
Creating '/private/tmp/test-poetry/src/tests/pipelines/data_processing/__init__.py': OK
Creating '/private/tmp/test-poetry/src/tests/pipelines/data_processing/test_pipeline.py': OK
Creating '/private/tmp/test-poetry/conf/base/parameters_data_processing.yml': OK
Pipeline 'data_processing' was successfully created.
❯ tree | grep -v '\.pyc$'
.
├── README.md
├── conf
│ ├── base
│ │ └── parameters_data_processing.yml
│ └── local
├── pyproject.toml
└── src
├── test_poetry
│ ├── __init__.py
│ ├── __pycache__
│ ├── pipeline_registry.py
│ ├── pipelines
│ │ └── data_processing
│ │ ├── __init__.py
│ │ ├── nodes.py
│ │ └── pipeline.py
│ └── settings.py
└── tests
└── pipelines
└── data_processing
├── __init__.py
└── test_pipeline.py
I don't think there's anything else we'll do for now. kedro new
will likely keep using setuptools
for the time being. Now that Kedro projects are mostly Python libraries, people can initialise them any way they want (poetry init
, poetry new
, pdm init
, flit init
), add some extra files and configs, and work normally.
I'm closing this issue, feel free to keep commenting if you disagree.
Hey @astrojuanlu , I wasn't able to pip install the kedro-init
package. Did this make it into a release within kedro?