kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Overview: Kedro's dependencies and what to do about Cookiecutter

Open lrcouto opened this issue 8 months ago • 6 comments

The original issue: Kedro has a lot of dependencies

  • We have an ongoing discussion on Kedro's number of dependencies and whether users perceive Kedro to be a "heavy" framework (https://github.com/kedro-org/kedro/issues/3659)
  • On top of that, some of our dependencies have a lot of dependencies. The notorious example is Cookiecutter, that has a pretty hefty dependency tree:
cookiecutter
├── Jinja2<4.0.0,>=2.7
│   └── MarkupSafe>=2.0
├── arrow
│   ├── python-dateutil>=2.7.0
│   │   └── six>=1.5
│   └── types-python-dateutil>=2.8.10
├── binaryornot>=0.4.4
│   └── chardet>=3.0.2
├── click<9.0.0,>=7.0
├── python-slugify>=4.0.0
│   └── text-unidecode>=1.3
├── pyyaml>=5.3.1
├── requests>=2.23.0
│   ├── certifi>=2017.4.17
│   ├── charset-normalizer<4,>=2
│   ├── idna<4,>=2.5
│   └── urllib3<3,>=1.21.1
└── rich
    ├── markdown-it-py>=2.2.0
    │   └── mdurl~=0.1
    └── pygments<3.0.0,>=2.13.0
  • Given this, we've decided to start figuring out ways to decouple Kedro from some of those dependencies.

Attempting to remove Rich

  • We've been using Rich with Kedro to handle logging, as well as our IPython integration and some of our prompting.
  • Some of our users do not want to have Rich as a mandatory library, as for some situations plain output is desirable. The way the log output is formatted can also create some issues with line breaks affecting other functions that use the log output. (https://github.com/kedro-org/kedro/issues/1752) / (https://github.com/kedro-org/kedro/issues/1733) / (https://github.com/kedro-org/kedro/issues/3276)
  • However, when attempting to create an option to use Kedro without Rich, we've bumping into the issue of Cookiecutter having Rich as one of its dependencies, making us dependent on Rich by proxy. (https://github.com/kedro-org/kedro/issues/2928)
  • The only way we can currently run Kedro without needing Rich is by downgrading Cookiecutter to a version before they themselves added Rich as one of their dependencies, which is hacky and not ideal. (https://github.com/kedro-org/kedro/pull/3838)

The Cookiecutter Issue

  • We are currently using Cookiecutter to handle our project and pipeline creation flows. That has the effect of making this whole process completely tied to Cookiecutter. (https://github.com/kedro-org/kedro/issues/3884#issuecomment-2161062978)
  • The way our project creation flow currently works is that everything from kedro new onwards is building up a data structure to be passed as a parameter to the cookiecutter() function, which handles the creation itself from the desired template.
graph TD
    A[kedro new]
    B[Initialize flag_inputs]
    C[Validate flag_inputs]
    D[Get starters_dict]
    E{starter_alias in starters_dict?}
    F[Set template_path and directory]
    G[Set selected_tools to lowercase]
    H[Create tmpdir]
    I[Get cookiecutter_dir]
    J[Get prompts_required]
    K{config_path provided?}
    L[Make cookiecutter_context]
    M[Cleanup tmpdir]
    N[Get extra_context]
    O[Make cookiecutter_args]
    P{telemetry_consent provided?}
    Q[Validate telemetry_consent]
    R[Call create_project]
    S[Call cookiecutter]

    A --> B --> C --> D --> E
    E -- Yes --> F
    E -- No --> F
    F --> G --> H --> I --> J --> K
    K -- No --> L
    K -- Yes --> M
    L --> M --> N --> O --> P
    P -- Yes --> Q --> R
    P -- No --> R
    R --> S


Current ideas for solutions

  • Move cookiecutter out of the Kedro core, having it be installed in an optional manner with something like pip install kedro[new] (https://github.com/kedro-org/kedro/issues/3884#issuecomment-2175031422)
  • Separate Kedro into two packages, e.g. kedro and kedro-core, letting the user choose which one fits their needs. (Also https://github.com/kedro-org/kedro/issues/3884#issuecomment-2175031422)
  • Partially or completely refactor the project/pipeline creation flows (a lot of work!).

Further questions to discuss

  • Do we have concrete evidence that a significant amount of our userbase thinks Kedro is heavyweight/cumbersome? Enough to justify a refactor or splitting it in packages?
  • What defines a "heavyweight" framework? What are the criteria we are using for that?
  • What do we consider the core features of Kedro?
  • What could be used to replace Cookiecutter, in case we decide to do that?
  • How would a possible split in two packages, or having one install option with extra dependencies, affect our user experience?

lrcouto avatar Jun 25 '24 16:06 lrcouto