dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`pull: false` subfield in `outs`

Open johan-sightic opened this issue 2 years ago • 8 comments

Request

In .dvc files under outs there is a push field, I think a pull field would be useful too. Data marked with pull: false would not be downloaded by a general dvc pull command.

Motivation

The raw data is very large and processing it requires a lot of compute. It is therefore done on a separate server. Later stages in the pipeline only depends on the smaller processed data. Running dvc pull without accidentally downloading the raw data would be nice.

johan-sightic avatar Apr 27 '23 14:04 johan-sightic

  1. What is the intended use of push: false?
  2. If you agree that this is a good feature I can try to implement it. Could you point me to where the push: field is defined?

johan-sightic avatar Apr 27 '23 14:04 johan-sightic

push: false is mainly intended for internal use to denote that certain files from dvc import or dvc import-url should not be pushed to DVC remotes (and should only ever be retrieved from their original source location).

pmrowla avatar Apr 28 '23 05:04 pmrowla

Thanks @johan-sightic! It makes sense to me, but you might want to know that @daavoo is already working on #9375. Do you think that would solve the issue for you?

dberenbaum avatar Apr 28 '23 15:04 dberenbaum

Hi @dberenbaum I'm not sure I understand the change. Is the idea that after that PR I just don't ever have to run dvc pull I just run dvc repro ?

johan-sightic avatar May 09 '23 06:05 johan-sightic

Hi @dberenbaum I'm not sure I understand the change. Is the idea that after that PR I just don't ever have to run dvc pull I just run dvc repro ?

Yes, and it will only pull dependencies as needed, so if that first stage never changes, the raw data will never get pulled.

dberenbaum avatar May 09 '23 16:05 dberenbaum

@efiop If we prioritize adding pull: false and setting it on dvc import-url --no-download, it should mostly solve the use case of wanting to track external data (although admittedly it's clunky). WDYT about prioritizing it?

dberenbaum avatar Jul 13 '23 13:07 dberenbaum

Hi @dberenbaum, any update or workaround for this?

johan-sightic avatar Feb 14 '25 08:02 johan-sightic

There is a workaround in that you can now skip dvc pull and instead run dvc repro --pull --allow-missing to pull the data needed during repro and to allow skipping pulling or running stages where the only change is that the data is missing.

dberenbaum avatar Feb 17 '25 14:02 dberenbaum