kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

fix(airflow): Fix nodes grouping

Open ElenaKhaustova opened this issue 9 months ago • 1 comments

Description

Fix https://github.com/kedro-org/kedro-plugins/issues/655

Development notes

Checklist

  • [ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • [ ] Updated the documentation to reflect the code changes
  • [ ] Added a description of this change in the relevant RELEASE.md file
  • [ ] Added tests to cover my changes

ElenaKhaustova avatar May 03 '24 13:05 ElenaKhaustova

@ankatiyar and I have also discussed whether we need to add grouping logic to the e2e test. IMO, grouping logic is currently covered well by unit tests, and we can go without it. Unit tests already include kedro airflow create --group-in-memory command run, which is kind of the e2e test. But happy to hear others' thoughts.

ElenaKhaustova avatar May 03 '24 20:05 ElenaKhaustova

Left a couple of small comment. Is this tested manually?

The only concern for me is implementing a new dfs here. I recalled I have to implement a bfs for the SoftFailRunner, at the end I end up using some Pipeline API, which actually is doing the search already. I just want to make sure this is considered already, and do we see a need to extend on the Pipeline API side instead.

example of bfs: https://github.com/noklam/kedro-softfail-runner/blob/970036ea8d5c969d02ed9150bfd4a2dc4baf967a/kedro_softfail_runner/core.py#L92

Thank you for the comments, I've addressed them.

Not sure I fully understand the concern about using dfs vs bfs. We cannot rely just on Pipeline API since we need to adjust the graph by adding new edges to group connected MemoryDatasets into one node. Thus we need an updated adjacency matrix and make one traversal to find connected components. There's no difference whether it's a dfs or bfs. The only issue which might happen with dfs is if the number of nodes is more than the default size of the recursion stack which is around 1000.

ElenaKhaustova avatar May 09 '24 14:05 ElenaKhaustova

Not sure I fully understand the concern about using dfs vs bfs. We cannot rely just on Pipeline API since we need to adjust the graph by adding new edges to group connected MemoryDatasets into one node. Thus we need an updated adjacency matrix and make one traversal to find connected components. There's no difference whether it's a dfs or bfs. The only issue which might happen with dfs is if the number of nodes is more than the default size of the recursion stack which is around 1000.

@ElenaKhaustova I haven't dived deep into this enough to understand it fully. I will approve this now as I don't want to block the PR as it works perfectly fine.

https://github.com/kedro-org/kedro/discussions/3758 It's another topic but a way to adjust the graph. The API is obviously bad but just want to show that there may be potential to open this up a little bit.

noklam avatar May 10 '24 11:05 noklam

Not sure I fully understand the concern about using dfs vs bfs. We cannot rely just on Pipeline API since we need to adjust the graph by adding new edges to group connected MemoryDatasets into one node. Thus we need an updated adjacency matrix and make one traversal to find connected components. There's no difference whether it's a dfs or bfs. The only issue which might happen with dfs is if the number of nodes is more than the default size of the recursion stack which is around 1000.

@ElenaKhaustova I haven't dived deep into this enough to understand it fully. I will approve this now as I don't want to block the PR as it works perfectly fine.

kedro-org/kedro#3758 It's another topic but a way to adjust the graph. The API is obviously bad but just want to show that there may be potential to open this up a little bit.

To be honest, I don't think the API is bad it's just out of the scope of pipeline API.

ElenaKhaustova avatar May 10 '24 13:05 ElenaKhaustova