kedro-plugins
kedro-plugins copied to clipboard
fix(airflow): Fix nodes grouping
Description
Fix https://github.com/kedro-org/kedro-plugins/issues/655
Development notes
Checklist
- [ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
- [ ] Updated the documentation to reflect the code changes
- [ ] Added a description of this change in the relevant
RELEASE.md
file - [ ] Added tests to cover my changes
@ankatiyar and I have also discussed whether we need to add grouping logic to the e2e test. IMO, grouping logic is currently covered well by unit tests, and we can go without it. Unit tests already include kedro airflow create --group-in-memory
command run, which is kind of the e2e test. But happy to hear others' thoughts.
Left a couple of small comment. Is this tested manually?
The only concern for me is implementing a new
dfs
here. I recalled I have to implement a bfs for theSoftFailRunner
, at the end I end up using somePipeline
API, which actually is doing the search already. I just want to make sure this is considered already, and do we see a need to extend on the Pipeline API side instead.example of bfs: https://github.com/noklam/kedro-softfail-runner/blob/970036ea8d5c969d02ed9150bfd4a2dc4baf967a/kedro_softfail_runner/core.py#L92
Thank you for the comments, I've addressed them.
Not sure I fully understand the concern about using dfs
vs bfs
. We cannot rely just on Pipeline
API since we need to adjust the graph by adding new edges to group connected MemoryDatasets into one node. Thus we need an updated adjacency matrix and make one traversal to find connected components. There's no difference whether it's a dfs
or bfs
. The only issue which might happen with dfs
is if the number of nodes is more than the default size of the recursion stack which is around 1000.
Not sure I fully understand the concern about using dfs vs bfs. We cannot rely just on Pipeline API since we need to adjust the graph by adding new edges to group connected MemoryDatasets into one node. Thus we need an updated adjacency matrix and make one traversal to find connected components. There's no difference whether it's a dfs or bfs. The only issue which might happen with dfs is if the number of nodes is more than the default size of the recursion stack which is around 1000.
@ElenaKhaustova I haven't dived deep into this enough to understand it fully. I will approve this now as I don't want to block the PR as it works perfectly fine.
https://github.com/kedro-org/kedro/discussions/3758 It's another topic but a way to adjust the graph. The API is obviously bad but just want to show that there may be potential to open this up a little bit.
Not sure I fully understand the concern about using dfs vs bfs. We cannot rely just on Pipeline API since we need to adjust the graph by adding new edges to group connected MemoryDatasets into one node. Thus we need an updated adjacency matrix and make one traversal to find connected components. There's no difference whether it's a dfs or bfs. The only issue which might happen with dfs is if the number of nodes is more than the default size of the recursion stack which is around 1000.
@ElenaKhaustova I haven't dived deep into this enough to understand it fully. I will approve this now as I don't want to block the PR as it works perfectly fine.
kedro-org/kedro#3758 It's another topic but a way to adjust the graph. The API is obviously bad but just want to show that there may be potential to open this up a little bit.
To be honest, I don't think the API is bad it's just out of the scope of pipeline
API.