Implement autodiscovery of project pipelines :mag:
Description
Closes #1664
Development notes
Checklist
- [ ] Read the contributing guidelines
- [x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
- [ ] Updated the documentation to reflect the code changes
- [ ] Added a description of this change in the
RELEASE.mdfile - [x] Added tests to cover my changes
@AntonyMilneQB Would love your opinion on something in particular. Currently, the Behave tests are failing, because one checks for a literal empty pipeline in text:
And the pipeline should contain no nodes # features/steps/cli_steps.py:431
Traceback (most recent call last):
File "C:\tools\miniconda3\envs\kedro_builder\lib\site-packages\behave\model.py", line 1329, in run
match.run(runner.context)
File "C:\tools\miniconda3\envs\kedro_builder\lib\site-packages\behave\matchers.py", line 98, in run
self.func(context, *args, **kwargs)
File "features\steps\cli_steps.py", line 442, in check_empty_pipeline_exists
assert '"__default__": pipeline([])' in pipeline_file.read_text("utf-8")
AssertionError
I think there's two options here:
- Actually try running the pipeline, and get the expected error message associated with running a pipeline with no nodes.
- Instead of testing that there's an empty pipeline defined, just look for the pipeline summing code in the template.
I'm inclined to think 1 is better, but want to make sure I'm understanding the purpose of this test properly.
For my reference and for anybody else who's interested, I've managed to write this in a way that only works(-ish?) on Python 3.10. In order to resolve this:
- Add
__init__.pyfile underpipelinesdirectory in test fixture.importlib.resources.contentsseems to pick up the subfolders without this only in Python 3.10+. - Adding the
__init__.pyas is fails, because__pycache__files get found and loaded. I think we need to usefilesinstead ofcontents. However, sincefilesis only introduced and recommended since Python 3.9, we need to installimportlib-resources>=1.3.
Need to take care of some other things, but I'll get back to this on the weekend if I can.
I think there's two options here:
- Actually try running the pipeline, and get the expected error message associated with running a pipeline with no nodes.
- Instead of testing that there's an empty pipeline defined, just look for the pipeline summing code in the template.
I'm inclined to think 1 is better, but want to make sure I'm understanding the purpose of this test properly.
Very good question. Normally I would agree with you and am also not a fan of explicitly trying to match the string in the pipeline_registry.py file. Here I'm not so sure though.
We already have tests in run.feature that do as you describe here:
Scenario: Run default python entry point without example code
Given I have prepared a config file
And I have run a non-interactive kedro new without starter
When I execute the kedro command "run"
Then I should get an error exit code
And I should get an error message including "Pipeline contains no nodes"
Hence I think the tests in new_project.feature should be interpreted as just checking the project template that was created, and so the string matching is correct. But you could reasonably argue that this is not necessary given the above run.feature tests. Note that the test_plugin_starter in that file is new (though again maybe we don't need to do the string matching).
The general pattern currently seems to be that every kedro xyz command has its own xyz.feature file that tests specifically the behaviour of that command. According to that, it makes sense to have run.feature tests that actually do the kedro run and look at the output and new_project.feature tests that do kedro new and look at the template created.
But I don't mind changing how new_project.feature works or even removing those tests altogether if they don't seem to add anything - up to you really. @noklam looked at some of these e2e tests recently so might have an opinion on it.
Generally agree with @AntonyMilneQB about the pattern - one file per kedro xxx command, so run.feature should do a kedro run and catch the expected output, and new.feature should be checking on the files that are created.
In that spirit, if we are not changing this structure, I tend to favor just checking the pipeline object as it feels more consistent. Similar to unit test that should really just test 1 thing at a time, I would prefer we are just testing the behavior of one kedro xxx at a time here.
Also, if we do maintain (or adopt) the pattern that kedro xyz corresponds to xyz.feature then let's rename the file new.feature (and similarly for any other misnamed files).
Also, if we do maintain (or adopt) the pattern that
kedro xyzcorresponds toxyz.featurethen let's rename the filenew.feature(and similarly for any other misnamed files).
Renamed that one. ✅
But I don't mind changing how
new_project.featureworks or even removing those tests altogether if they don't seem to add anything - up to you really. @noklam looked at some of these e2e tests recently so might have an opinion on it.
I removed it, because the test is really validating the difference between a pipeline defined with nodes and one without, but the written code is equivalent if we use autodiscovery (and thus there's nothing to distinguish). As you mentioned, the actual execution with empty pipeline is already tested, and I've updated the pipeline accordingly.
P.S. @deepyaman do you still need to fork kedro to make a PR?
P.S. @deepyaman do you still need to fork kedro to make a PR?
Nope! And I also see I don't need to explicitly at the DCO comment, if it's on the Kedro repo.
Very good point @noklam, thanks for the testing! I had thought of this case before but then forgotten all about it... Ideally I think a very simple project like pandas-iris that doesn't have multiple pipelines but just a single pipeline.py file would also "just work" with find_packages out of the box. i.e. correct behaviour here would be:
- check if there's a pipeline.py file that exposes
create_pipeline; if yes then use it. Just define__default__, no need for any other pipeline names - otherwise look for
pipelinesdirectory and register them as done now
However, since this single file (compared to multiple directories) is a bit of an unusual/special case I don't mind too much if we just don't support it. Or we can come back and add support for it later. it does seem like a good case to support though since it's the "simplified kedro project structure" that I think we should still support.
Sorry I forgot to put this in the ticket! 🤦
Thanks all for reviewing! I think this is more-or-less ready to go, with 3 open comments/confirmations. Now off to writing the associated docs PR!
Quick question, does kedro new (with no starters) generate a project with find_pipelines?
Quick question, does
kedro new(with no starters) generate a project withfind_pipelines?
Yes.