dvc icon indicating copy to clipboard operation
dvc copied to clipboard

dvc checkout will leave empty folders

Open glema-gpsw opened this issue 6 years ago • 13 comments

dvc checkout doesn't restore the exact snapshot as it was in a previous state. For example if you create a new folder and add files, and want to go back to a previous commit, where this folder and the files didn't exist, it will still show the folder, but empty.

Below you will find a reproducer of the problem.

#!/bin/bash                            
                                       
set -e                                 
set -x                                 
                                       
rm -rf myrepo                          
mkdir myrepo                           
cd myrepo                              
                                       
git init                               
dvc init                               
                                       
mkdir dir                              
dvc run -o dir/foo 'echo foo > dir/foo'
                                       
rm foo.dvc                             
                                       
dvc checkout                           
                                       
tree

glema-gpsw avatar Aug 14 '19 21:08 glema-gpsw

Discord context https://discordapp.com/channels/485586884165107732/485596304961962003/611308909470416898

efiop avatar Aug 14 '19 22:08 efiop

We are most likely using utils.remove method, so we probably need to make it(or some wrapper or whatever) remove empty parent directories up to repo root.

efiop avatar Sep 10 '19 15:09 efiop

Ok, as we've discovered with @iterative/engineering , this

mkdir dir                              
dvc run -o dir/foo 'echo foo > dir/foo'

is actually wrong and dvc repro will break on it, because the command itself should be able to create all the needed subdirectories.

But even if we include mkdir sbudir, when you clone this repo elsewhere and you dvc checkout, dvc will automatically create that directory, so it should also remove it when it checks out elsewhere.

efiop avatar Sep 10 '19 16:09 efiop

@efiop you explained that the first case is not correct. Could you please clarify with the second one - does DVC works properly and should we close this issue?

dmpetrov avatar Feb 03 '20 09:02 dmpetrov

@dmpetrov No longer able to reproduce. Everything works as expected. Closing. Thanks for the heads up!

efiop avatar Feb 03 '20 23:02 efiop

I am still able to reproduce this issue:

➜ dvc --version
1.8.2

➜ dvc init --no-scm
...

➜ mkdir foo

➜ dvc add foo
WARNING: 'foo' is empty.

➜ ls
foo/  foo.dvc

➜ cat foo.dvc
outs:
- md5: d751713988987e9331980363e24189ce.dir
  path: foo

➜ mkdir foo/bar

➜ dvc checkout --verbose
2020-10-12 14:27:46,792 DEBUG: Check for update is enabled.
2020-10-12 14:27:46,797 DEBUG: fetched: [(3,)]
2020-10-12 14:27:46,809 DEBUG: checking if 'foo'('HashInfo(name='md5', value='d751713988987e9331980363e24189ce.dir', dir_info=None)') has changed.
2020-10-12 14:27:46,809 DEBUG: Assuming '/Users/pokey/src/dvc-empty-test/.dvc/cache/d7/51713988987e9331980363e24189ce.dir' is unchanged since it is read-only
2020-10-12 14:27:46,811 DEBUG: Path '/Users/pokey/src/dvc-empty-test/foo' inode '87720699'
2020-10-12 14:27:46,811 DEBUG: fetched: [('99914b932bd37a50b983c5e7c90ae93b', '0', 'd751713988987e9331980363e24189ce.dir', '1602500101943105024')]
2020-10-12 14:27:46,812 DEBUG: 'foo' hasn't changed.
2020-10-12 14:27:46,813 DEBUG: Data 'foo' didn't change.
2020-10-12 14:27:46,814 DEBUG: fetched: [(4,)]
2020-10-12 14:27:46,817 DEBUG: Analytics is enabled.
2020-10-12 14:27:46,891 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/yb/k48gsy1s3qg1gdd72v9y6rdr0000gn/T/tmplekybbra']'
2020-10-12 14:27:46,893 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/yb/k48gsy1s3qg1gdd72v9y6rdr0000gn/T/tmplekybbra']'

➜ ls foo
bar/

At the end, I'd expect foo/bar to have been removed

pokey avatar Oct 12 '20 11:10 pokey

@pokey is right, its still an issue:

def test_checkout_empty_dir(tmp_dir, dvc):
    tmp_dir.dvc_gen({"empty_dir":{}})
    tmp_dir.gen({"empty_dir": {"subdir": {}}})

    dvc.checkout()

    assert not os.path.exists(tmp_dir / "empty_dir" / "subdir")

This test fails. Seems like our check falsely reports that the directory is unchanged. This is probably due to the way we calculate cache for dir: introducing empty directories does not influence dir hash.

Obvious fix: introduce directories (even empty) information to dir hash is probably a no-go - that will make all dir hashes calculated prior to change invalid.

pared avatar Oct 13 '20 10:10 pared

This is probably due to the way we calculate cache for dir: introducing empty directories does not influence dir hash.

Obvious fix: introduce directories (even empty) information to dir hash is probably a no-go - that will make all dir hashes calculated prior to change invalid.

Great points! Just for the record: that behaviour might be revisited in https://github.com/iterative/dvc/issues/829#issuecomment-711137723

But yeah, for now we could potentially just add some logic to clean up such cases.

efiop avatar Oct 18 '20 09:10 efiop

I'll just chime in and say that I'm still experiencing this problem on DVC 2.4.1.

mvonpohle avatar Jan 28 '22 01:01 mvonpohle

I'll just chime in and say that I'm still experiencing this problem on DVC 2.4.1.

Same on 2.9.3. Deleted some directories on one machine, pushed to remote, pulled this to my other machine and the directories remained, but were empty.

alanbuxton avatar Jan 28 '22 21:01 alanbuxton

@alanbuxton Sounds like you are tracking an empty directory, right? If so, that is correct dvc behaviour (it saved that the dir is empty and re-creates it empty on pull/checkout). This ticket is about leaving empty directories when switching from one version to another.

efiop avatar Jan 28 '22 22:01 efiop

@efiop yes makes sense. Apols for getting wrong end of the stick.

alanbuxton avatar Mar 13 '22 14:03 alanbuxton

I still face this issue in version 2.13.0 In my case I created two versions of the dataset. Let's say the dataset folder is called "numbers". This has folders with all odd numbered named folders until 10 with, for e.g, "one", "three", "five"... and so on. I tag this on git as "odd" Now I make changes to the dataset and remove all the folders inside "numbers" and replace then with a new set of folders whose folder names are all even numbered names now, for e.g, "two", "four", "six"... until 10. I tag this on git as "even" Note: I perform dvc add <folder> and those necessary steps to push the datasets.

I delete the dataset folder now. If I type dvc checkout it rebuilds the dataset with "even" folders since it is the latest dataset in cache. Now I wish to revert to "odd" dataset. I do git checkout odd dvc checkout

My numbers folder now has odd numbered folders and empty folders from "even" dataset.
Now I want to go the latest head on git for latest version of dataset. I do git checkout - dvc checkout This now has empty folders from "odd" dataset along with populated "even" folders.

narbhar avatar Jul 12 '22 21:07 narbhar

Checkout now deletes empty folders. Closing.

efiop avatar Dec 08 '23 11:12 efiop