dvc Support callback dependencies

Supporting 'callback' dependencies and outputs (for lack of a better term) would enable a number of interesting possibilities for using DVC to control processes that store data outside of DVC's file access, without requiring an explosion of remote access possibilities.

This would generalize what is possible with HTTP outputs and dependencies.

An example of what this could look like in a DVC file:

deps:
- cmd: python step-status.py --my-step
  md5: <checksum>

Instead of consulting a file, as with path:, DVC would run the specified command (which should be relatively quick), and compute the MD5 hash of its output. That command could do whatever it needs to in order to get the data status.

My specific use case is using DVC to control a large data import process that processes raw data files, loads them into PostgreSQL, and performs some additional computations involving intermediate files. I would implement a script that extracts data status from the PostgreSQL database so that DVC can check whether a step is up-to-date with the actual data currently in the database. I could implement this with HTTP dependencies and something like PostgREST, but that would introduce additional infrastructure requirements for people using the code (namely, a running PostgREST server).

Aug 07 '19 23:08 mdekstrand

This is probably one particular instantiation of #1577.

Aug 07 '19 23:08 mdekstrand

Hi @mdekstrand ! We currently have a so called "callback stages" that are ran every time. With those, you could do something like

dvc run -o marker python step-status.py --my-step
dvc run -d marker -d ... -o ... mycommand

this way, dvc would always run marker.dvc and if it changes the marker file, it will trigger reproductions down the pipeline. Would something like this work for you?

Aug 08 '19 02:08 efiop

Almost, but not quite. The problem is forcing ordering between generating the marker file, and whatever step came before to populate the database with the content that step-status checks. As soon as I make the callback stage depend on another, to force ordering, it ceases to be a callback stage. For example:

dvc run -o step1-out.txt -d ... python step1.py
dvc run -o step1-check.txt -d step1-out.txt python step-status.py --my-step
dvc run -d step1-check.txt -d step1

Without this, running a repro might run the step-status command before the data it requires is ready.

Aug 08 '19 14:08 mdekstrand

@mdekstrand Thank you for the explanation! Indeed, our current callback stage is not very useful in that scenario. And how about an explicit flag smth like always_reproduce: true, that would make arbitrary stage always run on repro. Would that be suitable in your scenario? I'm asking about that because we actually have plans to introduce it https://github.com/iterative/dvc/issues/1407 instead of the current fragile no-deps assumption.

As to your idea with

deps:
- cmd: python step-status.py --my-step
  md5: <checksum>

it seems like we don't really need the md5 of the output, we could simply use the return code of that command as an indicator. E.g. <0 - error, 0 - didn't change, >0 - changed.

Aug 08 '19 19:08 efiop

The always_reproduce: true solution would make the desired functionality possible, I believe, but the dependency graph would be cluttered with status steps.

I don't think I like the return code option, though, because then the status script must know how to compare current state against previous state. If DVC checksums the scripts output, then all the script needs to be able to do is emit a stable summary or description of current state, and DVC's existing logic can take care of determining if that represents a change or not.

Aug 09 '19 19:08 mdekstrand

@mdekstrand great point! In that case, what if we make the command return the checksum itself through stdout instead of us computing md5 of its output? That has the potential of being used not only for dependencies but also for outputs, as a cheap alternative-checksum plugin. There are a lot of things to consider with it though.

Aug 10 '19 01:08 efiop

Just a temporary workaround that comes to my mind. To make an intermediate stage effectively a "callback" stage, we can make it depend (along with other things, like DB upload pipeline) on an artificial callback stage that for example just dumps the current timestamp.

We can even reuse this dummy callback stage everywhere to make any number of stages always reproducible.

I love the idea to have cmd dependencies. It's simple to implement and solves a very good use case in a neat way.

@dmpetrov @Suor @pared @mroutis your thoughts?

Aug 10 '19 22:08 shcheklein

Seems like good feature request. I don't see too much problems with implementation at first sight. Probably some graceful exception handling will be required (when status check will be performed on non existing data). Also, I think that some kind of communiciation with user might be a good idea. Like Your executable dependency returned : X, do you want to proceed? In order to avoid situation where check returns some error and we assume it was desired output.

Aug 12 '19 11:08 pared

I am in favor of cmd dependencies, looks generic and could be used as simple as

cmd: psql -U user -c 'select id from some_table order by id desc limit 1'

and many other alike.

We need to come up with command line interface though. How should this look in dvc run?

Aug 12 '19 13:08 Suor

How about dvc run --dep-cmd="psql -U ..." ...?

Aug 12 '19 13:08 Suor

How about dvc run --dep-cmd="psql -U ..." ...?

@Suor , the only problem I'm seeing with this one is when using multiple dependencies;

dvc run -d script.py -d database_dump.csv --dep-cmd="psql -U ..."

How would you know which cmd correspond to which dependency?

Aug 13 '19 16:08 ghost

Cmd is a separate dependency, it doesn't have path, and doesn't correspond to anything.

Aug 13 '19 18:08 Suor

@Suor I'm actually not sure about that. We need the path to build the DAG properly. So what we need is something like

cmd: mycmd
md5: mymd5
path: path/to/dep

Aug 13 '19 21:08 efiop

But there is no path, DAG should not include cmd dep or this should be special cased somehow.

Aug 14 '19 01:08 Suor

Shouldn't comand dependency scripts be handled by scm?

Aug 14 '19 08:08 pared

@Suor My thinking was that mycmd would analyze path/to/dep inside, just as an alternative to our md5-based checksums. Not sure mycmd without dep path is any good. @mdekstrand What are you thoughts?

@pared They should. I didn't mean that path/to/dep should be a path to the script that we are running, but rather to the dep that it is analyzing.

Aug 14 '19 15:08 efiop

@efiop path has no meaning here so it should not be neither in stage file nor in command line UI. I don't understand what "analyze path/to/dep inside" even means, there is no path, it could be a call to database, some API request, whatever.

Aug 14 '19 17:08 Suor

@Suor The command mentioned by the OP, is analyzing db, so this stage should depend on it, the command is only helping us to judge if that dependency has changed or not.

Aug 14 '19 22:08 efiop

@efiop tbh, I also don't understand where does path come from? can you give an example when we would need it and what file that "command dependency" would depend on?

Aug 14 '19 23:08 shcheklein

@shcheklein In this use case, you could, for example, have has_changed script that in command dependency will be called like --cmd-dep="python has_changed.py some_file". So some_file will be dependency of dependency.

I guess we can say that we are already doing that, our current -d dependency is something like --cmd-dep="md5sum dependency".

Aug 15 '19 08:08 pared

@pared thanks for the example! (Btw, I would say it depends on has_changed.py the same way as on some_file in this specific case.)

I still don't quite understand what exactly are you suggesting though. Could you please describe the logic, CLI, high level implementation you have in mind?

Like - we run dvc run so-and-so and it is doing this-and-that, and generates a DVC-file with these fields.

Aug 15 '19 16:08 shcheklein

@pared if you need dependency of dependency when you should create a separate stage for that command and set that script as dependency.

In general path have no meaning here, this could easily be one-liner either with psql/mysql/... cli or even date +%F to execute stage no more often than once a day.

Aug 15 '19 16:08 Suor

@Suor you don't even need a separate stage. You can make a second regular file dependency in the that same stage as far as I understand. But may be I'm still missing something.

Aug 15 '19 16:08 shcheklein

In my use case, there is no file for the dependency command to analyze. It will go analyze the contents of the PostgreSQL database, which may be on a completely different machine (or hosted by a cloud service, e.g. Amazon RDS).

However, what we do need is a way to wire the dependencies to outputs, and that isn't clear. If we have one stage:

cmd: python import.py big-file.zip
deps:
- path: big-file.zip
- path: import.py
outs:
- cmd: check-big-file-status

And another stage:

cmd: psql -f process-big-data.sql
deps:
- path: process-big-data.sql
- cmd: check-big-file-status
outs:
- cmd: check-process-status

Then we need a way to know that the second depends on the first.

One way to fix this would be to use a 'status file' that exists for the primary purpose of enforcing this ordering, that is a second output of the first stage and an additional dependency of the second. This does not replace the need for command dependencies, however, because the existence of the status file does not mean the data is still in the database. If I blow away the database, all the status files are still there in my repository, but the database has no data.

Another way would be to add a key to the cmd outputs and deps, that is a user-defined unique name that is used to match dependencies and outputs.

A third would be to just match on the command, and assume identical commands are the same resource. This feels brittle but I can't identify a specific way it would fail.

Aug 15 '19 16:08 mdekstrand

@mdekstrand thanks for such exhausting explanation of your use case, that definitely sheds some more light on this issue!

Another way would be to add a key to the cmd outputs and deps, that is a user-defined unique name that is used to match dependencies and outputs.

I would like to avoid this idea, explanation how to use key in docs would be very extensive, and it would not be very convinient, to make user come up with ID each time he needs cmd-dep

I like the status file idea. As you mentioned, blowing database would not warn user, but maybe we could overcome that by making always_execute flag. It would enforce generation of status file, that way user knows thats somethings fishy.

Aug 16 '19 09:08 pared

@mdekstrand

@Suor noticed that you havecmd in your outs, which wasn't discussed before and complicates everything by a lot :)

I agree with @pared , if the only problem with always_execute(or --always-reproduce as we've called it before) flag is that it will create redundant stages, then we should probably start with that approach, instead of jumping into dep/out cmds, which is much more complex and is merely a convenience over --always-reproduce.

Aug 16 '19 11:08 efiop

@mdekstrand What do you think? 🙂

Aug 27 '19 14:08 efiop

I would like to see the cmd solution, even if it requires status files for ordering, but am fine with trying a solution based on always_reproduce to get started.

In my mind, though, cmd deps/outs are a low-effort pluggable extension to external dependencies/outputs: rather than fetching from HTTP or S3, dvc would run a command.

The other idea I have for the functionality would require a lot more work, I think - a plugin mechanism so that in-repo code can add additional external dependency/output handlers, and then using URL-based dependencies so that pq://bookdata/big-thing-status would dispatch through plugin code to get the the status of the big thing.

Aug 27 '19 18:08 mdekstrand

@mdekstrand --always-changed was released in 0.59.2, please give it a try and be sure to let us know how that works for you :slightly_smiling_face: Thanks for all the feedback!

Sep 09 '19 23:09 efiop

@efiop Will do as soon as I can! It's been a busy 2-3 weeks.

Sep 15 '19 12:09 mdekstrand

dvc dvc copied to clipboard

Support callback dependencies

dvc
dvc copied to clipboard