kedro Implement autodiscovery of project pipelines :mag:

Description

Closes #1664

Development notes

Checklist

[ ] Read the contributing guidelines
[x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
[ ] Updated the documentation to reflect the code changes
[ ] Added a description of this change in the RELEASE.md file
[x] Added tests to cover my changes

Jul 13 '22 08:07 deepyaman

@AntonyMilneQB Would love your opinion on something in particular. Currently, the Behave tests are failing, because one checks for a literal empty pipeline in text:

    And the pipeline should contain no nodes                          # features/steps/cli_steps.py:431
      Traceback (most recent call last):
        File "C:\tools\miniconda3\envs\kedro_builder\lib\site-packages\behave\model.py", line 1329, in run
          match.run(runner.context)
        File "C:\tools\miniconda3\envs\kedro_builder\lib\site-packages\behave\matchers.py", line 98, in run
          self.func(context, *args, **kwargs)
        File "features\steps\cli_steps.py", line 442, in check_empty_pipeline_exists
          assert '"__default__": pipeline([])' in pipeline_file.read_text("utf-8")
      AssertionError

I think there's two options here:

Actually try running the pipeline, and get the expected error message associated with running a pipeline with no nodes.
Instead of testing that there's an empty pipeline defined, just look for the pipeline summing code in the template.

I'm inclined to think 1 is better, but want to make sure I'm understanding the purpose of this test properly.

For my reference and for anybody else who's interested, I've managed to write this in a way that only works(-ish?) on Python 3.10. In order to resolve this:

Add __init__.py file under pipelines directory in test fixture. importlib.resources.contents seems to pick up the subfolders without this only in Python 3.10+.
Adding the __init__.py as is fails, because __pycache__ files get found and loaded. I think we need to use files instead of contents. However, since files is only introduced and recommended since Python 3.9, we need to install importlib-resources>=1.3.

Need to take care of some other things, but I'll get back to this on the weekend if I can.

Jul 14 '22 13:07 deepyaman

I think there's two options here:

Actually try running the pipeline, and get the expected error message associated with running a pipeline with no nodes.

Instead of testing that there's an empty pipeline defined, just look for the pipeline summing code in the template.

I'm inclined to think 1 is better, but want to make sure I'm understanding the purpose of this test properly.

Very good question. Normally I would agree with you and am also not a fan of explicitly trying to match the string in the pipeline_registry.py file. Here I'm not so sure though.

We already have tests in run.feature that do as you describe here:

  Scenario: Run default python entry point without example code
    Given I have prepared a config file
    And I have run a non-interactive kedro new without starter
    When I execute the kedro command "run"
    Then I should get an error exit code
    And I should get an error message including "Pipeline contains no nodes"

Hence I think the tests in new_project.feature should be interpreted as just checking the project template that was created, and so the string matching is correct. But you could reasonably argue that this is not necessary given the above run.feature tests. Note that the test_plugin_starter in that file is new (though again maybe we don't need to do the string matching).

The general pattern currently seems to be that every kedro xyz command has its own xyz.feature file that tests specifically the behaviour of that command. According to that, it makes sense to have run.feature tests that actually do the kedro run and look at the output and new_project.feature tests that do kedro new and look at the template created.

But I don't mind changing how new_project.feature works or even removing those tests altogether if they don't seem to add anything - up to you really. @noklam looked at some of these e2e tests recently so might have an opinion on it.

Jul 19 '22 09:07 antonymilne

Generally agree with @AntonyMilneQB about the pattern - one file per kedro xxx command, so run.feature should do a kedro run and catch the expected output, and new.feature should be checking on the files that are created.

In that spirit, if we are not changing this structure, I tend to favor just checking the pipeline object as it feels more consistent. Similar to unit test that should really just test 1 thing at a time, I would prefer we are just testing the behavior of one kedro xxx at a time here.

Jul 19 '22 12:07 noklam

Also, if we do maintain (or adopt) the pattern that kedro xyz corresponds to xyz.feature then let's rename the file new.feature (and similarly for any other misnamed files).

Jul 19 '22 14:07 antonymilne

Also, if we do maintain (or adopt) the pattern that kedro xyz corresponds to xyz.feature then let's rename the file new.feature (and similarly for any other misnamed files).

Renamed that one. ✅

Jul 20 '22 22:07 deepyaman

But I don't mind changing how new_project.feature works or even removing those tests altogether if they don't seem to add anything - up to you really. @noklam looked at some of these e2e tests recently so might have an opinion on it.

I removed it, because the test is really validating the difference between a pipeline defined with nodes and one without, but the written code is equivalent if we use autodiscovery (and thus there's nothing to distinguish). As you mentioned, the actual execution with empty pipeline is already tested, and I've updated the pipeline accordingly.

Jul 20 '22 22:07 deepyaman

P.S. @deepyaman do you still need to fork kedro to make a PR?

Aug 02 '22 15:08 antonymilne

P.S. @deepyaman do you still need to fork kedro to make a PR?

Nope! And I also see I don't need to explicitly at the DCO comment, if it's on the Kedro repo.

Aug 04 '22 10:08 deepyaman

Very good point @noklam, thanks for the testing! I had thought of this case before but then forgotten all about it... Ideally I think a very simple project like pandas-iris that doesn't have multiple pipelines but just a single pipeline.py file would also "just work" with find_packages out of the box. i.e. correct behaviour here would be:

check if there's a pipeline.py file that exposes create_pipeline; if yes then use it. Just define __default__, no need for any other pipeline names
otherwise look for pipelines directory and register them as done now

However, since this single file (compared to multiple directories) is a bit of an unusual/special case I don't mind too much if we just don't support it. Or we can come back and add support for it later. it does seem like a good case to support though since it's the "simplified kedro project structure" that I think we should still support.

Sorry I forgot to put this in the ticket! 🤦

Aug 08 '22 10:08 antonymilne

Thanks all for reviewing! I think this is more-or-less ready to go, with 3 open comments/confirmations. Now off to writing the associated docs PR!

Aug 14 '22 12:08 deepyaman

Quick question, does kedro new (with no starters) generate a project with find_pipelines?

Aug 26 '22 11:08 noklam

Quick question, does kedro new (with no starters) generate a project with find_pipelines?

Yes.

Aug 26 '22 12:08 antonymilne