add: Repeated `dvc add` clears executable flag from files (`cache.type = copy`)
Bug Report
Description
With cache.type = copy, executing dvc add -R twice clears the executable flags from the added files. The first time, isexec is correctly added to the corresponding .dvc file; the second time, the files lose their executable flag and isexec is removed from the .dvc file.
Reproduce
#!/bin/bash
# create a repo with two files
mkdir dvc-reprex
cd dvc-reprex
mkdir dvcfiles
cd dvcfiles
echo 123 > f1
echo 234 > f2
cd ..
# Set the two files to executable
chmod a+x dvcfiles/*
# Verify
ls -lap dvcfiles
# Init git and dvc repo with cache.type copy
git init
dvc init
dvc config cache.type copy
git add .dvc/config
git commit -m "setup dvc"
# add the dvcfiles dir
dvc add -R dvcfiles
git add dvcfiles
git commit -m "committed dvcfiles"
# Check that files are still executable:
ls -lap dvcfiles
# All is fine:
# total 20
# drwxr-xr-x 1 zzz zzz 52 Apr 22 15:11 ./
# drwxr-xr-x 1 zzz zzz 52 Apr 22 15:01 ../
# -rwxr--r-- 1 zzz zzz 4 Apr 22 14:59 f1
# -rw-r--r-- 1 zzz zzz 82 Apr 22 15:07 f1.dvc
# -rwxr--r-- 1 zzz zzz 4 Apr 22 15:00 f2
# -rw-r--r-- 1 zzz zzz 82 Apr 22 15:07 f2.dvc
# -rw-r--r-- 1 zzz zzz 8 Apr 22 15:07 .gitignore
# DVC is clean:
dvc status
# Data and pipelines are up to date.
# dvc add again, which should do nothing IMO:
dvc add -R dvcfiles
# However files are not executable now:
ls -lap dvcfiles
# total 20
# drwxr-xr-x 1 zzz zzz 52 Apr 22 15:11 ./
# drwxr-xr-x 1 zzz zzz 52 Apr 22 15:01 ../
# -rw-r--r-- 1 zzz zzz 4 Apr 22 14:59 f1
# -rw-r--r-- 1 zzz zzz 67 Apr 22 15:11 f1.dvc
# -rw-r--r-- 1 zzz zzz 4 Apr 22 15:00 f2
# -rw-r--r-- 1 zzz zzz 67 Apr 22 15:11 f2.dvc
# -rw-r--r-- 1 zzz zzz 8 Apr 22 15:07 .gitignore
git status
# On branch master
# Changes not staged for commit:
# (use "git add <file>..." to update what will be committed)
# (use "git restore <file>..." to discard changes in working directory)
# modified: dvcfiles/f1.dvc
# modified: dvcfiles/f2.dvc
git diff
# diff --git a/dvcfiles/f1.dvc b/dvcfiles/f1.dvc
# index 169c330..2d76ec6 100644
# --- a/dvcfiles/f1.dvc
# +++ b/dvcfiles/f1.dvc
# @@ -1,5 +1,4 @@
# outs:
# - md5: ba1f2511fc30423bdbb183fe33f3dd0f
# size: 4
# - isexec: true
# path: f1
# diff --git a/dvcfiles/f2.dvc b/dvcfiles/f2.dvc
# index 22c3317..7c42519 100644
# --- a/dvcfiles/f2.dvc
# +++ b/dvcfiles/f2.dvc
# @@ -1,5 +1,4 @@
# outs:
# - md5: e42bb897d0afcdb1f1c46fb5e0c1ad22
# size: 4
# - isexec: true
# path: f2
dvc version
# DVC version: 2.10.1 (pip)
# ---------------------------------
# Platform: Python 3.9.7 on Linux-5.10.0-11-amd64-x86_64-with-glibc2.31
# Supports:
# webhdfs (fsspec = 2022.3.0),
# http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
# https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
# Cache types: reflink, hardlink, symlink
# Cache directory: btrfs on /dev/sda1
# Caches: local
# Remotes: None
# Workspace directory: btrfs on /dev/sda1
# Repo: dvc, git
git --version
# git version 2.30.2
Expected
One of two things:
- either don't touch the exec flag
- or, if there's a good reason for it, clear it on the first
dvc add
Environment information
Debian 11, both btrfs and ext4
Output of dvc doctor:
$ dvc doctor
DVC version: 2.10.1 (pip)
---------------------------------
Platform: Python 3.9.7 on Linux-5.10.0-11-amd64-x86_64-with-glibc2.31
Supports:
webhdfs (fsspec = 2022.3.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/sda1
Caches: local
Remotes: None
Workspace directory: btrfs on /dev/sda1
Repo: dvc, git
Additional Information (if any):
- Tested on btrfs and on ext4
- I tried to see if this was a consequence of using
dvc add -R. Note that when adding the entire directory (not using-R), the exec flag is cleared on the first `dvc add!
@skshetry @efiop In the dvc add -R dvcfiles scenario, I'm not understanding why DVC doesn't preserve isexec since each individual file has its own .dvc file. In the case of adding the entire directory with dvc add dvcfiles, it might be trickier since there's no way currently to save the isexec info. WDYT?
It seems like it would also be nice to preserve the executable flag at the group/other levels, but let's leave that for a separate issue.
I can reproduce it on the latest DVC version.
It looks like this is a regression introduced in #7313.
Prior to that change, after running the script above, I get:
$ ll dvc-reprex/dvcfiles
total 40
drwxr-xr-x 7 dave wheel 224 May 3 20:12 .
drwxr-xr-x 6 dave wheel 192 May 3 20:12 ..
-rw-r--r-- 1 dave wheel 8 May 3 20:12 .gitignore
-rwxr--r-- 1 dave wheel 4 May 3 20:12 f1
-rw-r--r-- 1 dave wheel 82 May 3 20:12 f1.dvc
-rwxr--r-- 1 dave wheel 4 May 3 20:12 f2
-rw-r--r-- 1 dave wheel 82 May 3 20:12 f2.dvc
Running the same script after that change, I get:
$ ll dvc-reprex/dvcfiles
total 40
drwxr-xr-x 7 dave wheel 224 May 3 20:12 .
drwxr-xr-x 6 dave wheel 192 May 3 20:12 ..
-rw-r--r-- 1 dave wheel 8 May 3 20:12 .gitignore
-rw-r--r-- 1 dave wheel 4 May 3 20:12 f1
-rw-r--r-- 1 dave wheel 67 May 3 20:12 f1.dvc
-rw-r--r-- 1 dave wheel 4 May 3 20:12 f2
-rw-r--r-- 1 dave wheel 67 May 3 20:12 f2.dvc
cc @efiop
I'm curious how we handle this kind of metadata for now. An example is the tar in Unix which is good at handling Unix permissions, hard links, softlinks, different times associated with files. To achieve this, it stores this metadata as a header to the original file.