kedro
kedro copied to clipboard
kedro run CLI incorrectly splits the names of nodes at commas
Description
Running kedro with the command kedro run --from-nodes "two_inputs([A0,B0]) -> [C1]"
will cause an error: Pipeline does not contain nodes named ['B0]) -> [C1]', 'two_inputs([A0'].
Kedro has incorrectly split the name of the intended Node
into the names of two Node
s that do not exist.
Context
This bug prevents the user from re-running kedro from any Node
with a comma in its name. This is the case for any Node
that is not explicitly given a name and has more than one input.
Steps to Reproduce
- Create a kedro project
- Attempt to run the pipeline from a node with a comma in its name (
kedro run --from-nodes 'My node, a good node'
) - Receive the error
Pipeline does not contain nodes named ['a good node', 'My node'].
Expected Result
Kedro should try to run from the node with the specified name.
Actual Result
Kedro attempts to run from multiple nodes, with names drawn from the result of splitting the intended Node
name at commas.
Linking this issue to #1795. This happens when node name
is not specified and Kedro will generate a name automatically.
As discussed in backlog grooming, this is a very long-standing issue bug and it would be nice to get it fixed. I've looked through our old Jira board to get the previous discussion for context too.
Possible solutions here:
- some syntax where you can escape a comma in a node name with a backslash so it doesn't get split by
_split_string
- alter node name syntax to use
;
rather than,
- quick but hacky fix. This is a breaking change though (won't affect anyone using--from-nodes
etc. arguments because those don't work, but would affect someone using an automatically-generated node name in--node
though) - @jmholzer idea of a fine-state machine csv-parsing type solution to do the string splitting rather than just blindly splitting on
,
. This would basically mean that,
inside()
would basically not split the string. It wouldn't solve the case of a comma explicitly being used inside a node name, but it would solve the automatically-named node case, which is way more important - change the resume scenario suggestion to use
from-inputs
instead. This should completely circumvent problems around commas (unless you choose to use a comma in your catalog entry name... but no one does that) - change
--from-nodes
to be amultiple=True
argument in click, just like--node
is. This seems pretty inconsistent currently actually: you do--node node_1 --node node_2
but--from-nodes node_1, node_2
. Why not--from-nodes node_1 --from-nodes node_2
? At a glance this reads a bit weird to me though 🤔
Overall I like (in order): option 5/3, 4, 1, 2.
We don't need to solve this part at the same time, but just for completeness, let me note here part of the previous discussion that questioned how kedro automatically generates a name for a node. There was research piece on this done before but I'm not sure what the outcome was - we'd need @yetudada or @idanov for this I think. Some options I vaguely recall though:
- hash the node somehow to generate a label
- randomly generated words used to name nodes - like the above but more human-readable
Side-note: personally I'm not a big fan of exactly how the current automatic node name is formatted. What do the square brackets add? Why is there no space after the comma? I think this is a historical hangover from the days when kedro inputs had to be specified as a list.
If we decide we want to do one of these then fixing the case with commas will likely become irrelevant, because an automatically generated node name would no longer have a comma in it. Possibly we should still consider doing option 5 above though for consistency.
This was discussed in Technical Design
- The team agreed that we should fix the way commas are split and also alter the way we generate the automatic node name for
0.19.0
. - @ankatiyar checked that nodes can't have commas in the name so we don't have to worry about the use case where users create nodes with commas.
- We also discussed the idea of using
--from-inputs
instead of--from-nodes
for the re-run suggestion but as @jmholzer pointed out the re-run logic is very complex, so while this is a good idea it would be a lot of work.
The actions to be done after this dicussion:
- [x] Non breaking: fix how commas are split (point 3 above) https://github.com/kedro-org/kedro/issues/2012
- [x] Breaking: change how node names are generated to be name + output where the outputs are separated by
_
or-
instead of,
https://github.com/kedro-org/kedro/issues/2013 - [ ] Investigate CLI consistencies (point 5 above) https://github.com/kedro-org/kedro/issues/2014
Closing this issue in favour of the follow up actions from Tech Design.