dvc Keep persistent outputs protected by default?

Due to memory concerns, some of my pipeline stages have persistent outputs that I handle in some python scripts. The pipeline stage receives a JSON as input and outputs a folder containing image files. As there is no need to rewrite those files every execution, I check inside the folder if a file is present (read-only) to avoid reprocessing it. As the dataset can get quite large, the space consumption becomes worrisome, as every dvc repro executed needs to unprotect all the files in the folder, copying them from the cache to the workspace. If an output could be marked as safe, so it would only suffer from append/remove operations, the unprotect could be avoided, reducing the space usage.

Sep 08 '21 13:09 menegais1

Probably we can even avoid running unprotect in this at all? Or unprotect dir - make it writable. Seem like there should be a config option for this.

Sep 08 '21 15:09 shcheklein

Thinking even more about this, I've renamed the ticket a bit.

To be honest, it's not that clear why do we do this extra step (unprotect) in the first place. There are a lot of cases (majority of them to be honest that come to my mind) is when we need an append semantics (script doesn't overwrite the files). It can be very expensive to unprotect things and protect them back every time.

I'm not sure about checkpoints though, but we also use a different flag for them for other reasons?

cc @dberenbaum @pmrowla

Sep 09 '21 04:09 shcheklein

For checkpoints, I think it will be necessary to unprotect because it's about loading state from the existing model file and writing an updated model file at the end. However, configuring a flag for dvc repro and dvc exp run generally to avoid needing to unprotect is probably fine.

Sep 09 '21 18:09 dberenbaum

@dberenbaum Just to confirm, you meant that we would need a dvc.yaml option for each such output, right? Similar to what is implemented in https://github.com/iterative/dvc/pull/7072 by @rgferrari ?

Dec 06 '21 07:12 efiop

Yes, this was a mistake by me. I was thinking of a dvc.yaml option since it seems more useful to permanently specify the behavior than have to include the option every time you reproduce the pipeline.

Dec 06 '21 16:12 dberenbaum

Hi all, I am here to re-inject some interest in this issue (relevant discord thread).

I am working with a pretty common workflow where there is a "prep_data.py" script that creates cleaned data for training and other downstream stages.

The problem is that the scenario involves a shared-cache (multiple users on same machine with a shared data drive). Any sort of pipeline operation (e.g. dvc repro) involves every file created by to be unprotected which is extremely expensive. The 'prep_data.py' script always deletes and recreates new files if there are detected changes so I am not worried about corruption (that much).

I think an option to not unprotect and/or allowing us to programatically control protection (via python sdk) would be useful

May 30 '23 19:05 uditrana

To be honest, it's not that clear why do we do this extra step (unprotect) in the first place. There are a lot of cases (majority of them to be honest that come to my mind) is when we need an append semantics (script doesn't overwrite the files). It can be very expensive to unprotect things and protect them back every time.

I'm not sure about checkpoints though, but we also use a different flag for them for other reasons?

Now that we don't have checkpoints, should we consider dropping unprotect as default behavior completely? I guess we aren't in a good position to make this kind of breaking change right now, but at least if we introduce the flag soon, we could make it the default in the next major release.

Jun 15 '23 16:06 dberenbaum