kedro kedro run CLI incorrectly splits the names of nodes at commas

Description

Running kedro with the command kedro run --from-nodes "two_inputs([A0,B0]) -> [C1]" will cause an error: Pipeline does not contain nodes named ['B0]) -> [C1]', 'two_inputs([A0'].

Kedro has incorrectly split the name of the intended Node into the names of two Nodes that do not exist.

Context

This bug prevents the user from re-running kedro from any Node with a comma in its name. This is the case for any Node that is not explicitly given a name and has more than one input.

Steps to Reproduce

Create a kedro project
Attempt to run the pipeline from a node with a comma in its name (kedro run --from-nodes 'My node, a good node')
Receive the error Pipeline does not contain nodes named ['a good node', 'My node'].

Expected Result

Kedro should try to run from the node with the specified name.

Actual Result

Kedro attempts to run from multiple nodes, with names drawn from the result of splitting the intended Node name at commas.

Sep 05 '22 12:09 jmholzer

Linking this issue to #1795. This happens when node name is not specified and Kedro will generate a name automatically.

Sep 05 '22 12:09 noklam

As discussed in backlog grooming, this is a very long-standing issue bug and it would be nice to get it fixed. I've looked through our old Jira board to get the previous discussion for context too.

Possible solutions here:

some syntax where you can escape a comma in a node name with a backslash so it doesn't get split by _split_string
alter node name syntax to use ; rather than , - quick but hacky fix. This is a breaking change though (won't affect anyone using --from-nodes etc. arguments because those don't work, but would affect someone using an automatically-generated node name in --node though)
@jmholzer idea of a fine-state machine csv-parsing type solution to do the string splitting rather than just blindly splitting on ,. This would basically mean that , inside () would basically not split the string. It wouldn't solve the case of a comma explicitly being used inside a node name, but it would solve the automatically-named node case, which is way more important
change the resume scenario suggestion to use from-inputs instead. This should completely circumvent problems around commas (unless you choose to use a comma in your catalog entry name... but no one does that)
change --from-nodes to be a multiple=True argument in click, just like --node is. This seems pretty inconsistent currently actually: you do --node node_1 --node node_2 but --from-nodes node_1, node_2. Why not --from-nodes node_1 --from-nodes node_2? At a glance this reads a bit weird to me though 🤔

Overall I like (in order): option 5/3, 4, 1, 2.

We don't need to solve this part at the same time, but just for completeness, let me note here part of the previous discussion that questioned how kedro automatically generates a name for a node. There was research piece on this done before but I'm not sure what the outcome was - we'd need @yetudada or @idanov for this I think. Some options I vaguely recall though:

hash the node somehow to generate a label
randomly generated words used to name nodes - like the above but more human-readable

Side-note: personally I'm not a big fan of exactly how the current automatic node name is formatted. What do the square brackets add? Why is there no space after the comma? I think this is a historical hangover from the days when kedro inputs had to be specified as a list.

If we decide we want to do one of these then fixing the case with commas will likely become irrelevant, because an automatically generated node name would no longer have a comma in it. Possibly we should still consider doing option 5 above though for consistency.

Sep 05 '22 16:09 antonymilne

This was discussed in Technical Design

The team agreed that we should fix the way commas are split and also alter the way we generate the automatic node name for 0.19.0.
@ankatiyar checked that nodes can't have commas in the name so we don't have to worry about the use case where users create nodes with commas.
We also discussed the idea of using --from-inputs instead of --from-nodes for the re-run suggestion but as @jmholzer pointed out the re-run logic is very complex, so while this is a good idea it would be a lot of work.

The actions to be done after this dicussion:

[x] Non breaking: fix how commas are split (point 3 above) https://github.com/kedro-org/kedro/issues/2012
[x] Breaking: change how node names are generated to be name + output where the outputs are separated by _ or - instead of , https://github.com/kedro-org/kedro/issues/2013
[ ] Investigate CLI consistencies (point 5 above) https://github.com/kedro-org/kedro/issues/2014

Nov 09 '22 16:11 merelcht

Closing this issue in favour of the follow up actions from Tech Design.

Nov 09 '22 17:11 merelcht

kedro kedro copied to clipboard

kedro run CLI incorrectly splits the names of nodes at commas

Description

Context

Steps to Reproduce

Expected Result

Actual Result

kedro
kedro copied to clipboard