kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Automatically skip running nodes with persisted outputs

Open jmholzer opened this issue 2 years ago • 11 comments

Description

Re-running nodes which have:

  1. Persisted outputs
  2. No upstream dependencies which would cause their output to change

is an unnecessary expense. It might be a good idea to have a flag which would automatically skip running these nodes.

It is currently possible to achieve this by specifying nodes to run from, though this process is manual and potentially error-prone.

Context

User @pedro-sarpen opened https://github.com/kedro-org/kedro/pull/2005 to address this issue, though there may be a better solution to the problem that we should investigate.

jmholzer avatar Feb 10 '23 16:02 jmholzer

This is definitely something we should have, although I don't have any concrete ideas on the best way to do it off the top of my head. The broader question of "change capture" has been discussed before but I don't think anything was properly decided on. Maybe now would be the right time to re-open those discussions.

antonymilne avatar Feb 14 '23 06:02 antonymilne

I'd like to add that I'd love this feature. Currently, I have to comment out nodes in my pipelines and add their outputs to the inputs of the pipeline. That's really tedious and seems like an anti-pattern.

marcosfelt avatar Mar 02 '23 11:03 marcosfelt

(Our team is working on this and plan to open-source)

sbrugman avatar Mar 10 '23 16:03 sbrugman

Linking: https://github.com/kedro-org/kedro/issues/2410

merelcht avatar Mar 27 '23 13:03 merelcht

xref change capture https://github.com/kedro-org/kedro/issues/221

To me, the main difficulty is that doing this requires making assumptions about the node functions, in particular that they're pure, i.e. that they don't have any spurious inputs, like randomness, the current date, and so on. If we assume so, then doing some sort of hashing on the inputs is technically sufficient.

As I said in #221, this would make kedro run no longer stateless.

astrojuanlu avatar Nov 26 '23 22:11 astrojuanlu

Related Update: our team open-sourced pycodehash just now, and are working on a Kedro runner that is able to skip cached datasets and nodes.

sbrugman avatar Nov 28 '23 21:11 sbrugman

@sbrugman I was having a look at PyCodeHash, looks superb!

One question: what can we do for cases like these?

def preprocess_data(df: pl.DataFrame) -> pl.DataFrame:
    now = dt.datetime.now()
    if now.minute % 2 == 0:
        raise Exception("boom")
    return df.head()

? These would effectively be cached, am I right?

astrojuanlu avatar Nov 29 '23 12:11 astrojuanlu

This one is not deterministic. The random component (time) should be a parameter/dataset in order for this to work.

(Idempotent pipelines are required)

sbrugman avatar Nov 29 '23 17:11 sbrugman

Closed #221 as a duplicate of this one. The former is older and has some extra context.

astrojuanlu avatar Feb 04 '24 19:02 astrojuanlu

Previously: #30, #25, #82.

astrojuanlu avatar Feb 04 '24 19:02 astrojuanlu

After showing Kedro to a data scientist, this was the first thing they asked. They were familiar with DVC.

astrojuanlu avatar Jul 23 '24 08:07 astrojuanlu