`pull: false` subfield in `outs`
Request
In .dvc files under outs there is a push field, I think a pull field would be useful too. Data marked with pull: false would not be downloaded by a general dvc pull command.
Motivation
The raw data is very large and processing it requires a lot of compute. It is therefore done on a separate server. Later stages in the pipeline only depends on the smaller processed data. Running dvc pull without accidentally downloading the raw data would be nice.
- What is the intended use of
push: false? - If you agree that this is a good feature I can try to implement it. Could you point me to where the
push:field is defined?
push: false is mainly intended for internal use to denote that certain files from dvc import or dvc import-url should not be pushed to DVC remotes (and should only ever be retrieved from their original source location).
Thanks @johan-sightic! It makes sense to me, but you might want to know that @daavoo is already working on #9375. Do you think that would solve the issue for you?
Hi @dberenbaum I'm not sure I understand the change. Is the idea that after that PR I just don't ever have to run dvc pull I just run dvc repro ?
Hi @dberenbaum I'm not sure I understand the change. Is the idea that after that PR I just don't ever have to run
dvc pullI just rundvc repro?
Yes, and it will only pull dependencies as needed, so if that first stage never changes, the raw data will never get pulled.
@efiop If we prioritize adding pull: false and setting it on dvc import-url --no-download, it should mostly solve the use case of wanting to track external data (although admittedly it's clunky). WDYT about prioritizing it?
Hi @dberenbaum, any update or workaround for this?
There is a workaround in that you can now skip dvc pull and instead run dvc repro --pull --allow-missing to pull the data needed during repro and to allow skipping pulling or running stages where the only change is that the data is missing.