dvc icon indicating copy to clipboard operation
dvc copied to clipboard

data status: show untracked files in normal mode

Open dberenbaum opened this issue 3 years ago • 5 comments

From https://github.com/iterative/dvc.org/pull/3812#discussion_r931002653:

Is it a performance optimization?

It no longer is, or will no longer be of performance concern. We didn't have --untracked-files=normal support in dulwich so we had to use --untracked-files=all, so it'd be noisy and slow if you have too many untracked files.

But thanks to @dtrifiro's great work in pygit2, the git.status() is about 30x faster. Plus, we now have --untracked-files=normal support in pygit2, which will be much faster even for a very large repository. So there's no performance issue now if we decide to revisit. See iterative/scmrepo#118.

(I have been testing a repo with 150,000 untracked files from a dataset, and --untracked-files=normal takes ~20ms now)

dberenbaum avatar Jul 27 '22 20:07 dberenbaum

Note that we now have the support in upstream scmrepo, but we are waiting for a bugfix from dulwich to release a new version.

skshetry avatar Aug 08 '22 12:08 skshetry

@dberenbaum, we have the support for this in upstream now.

skshetry avatar Sep 15 '22 06:09 skshetry

Also there have been questions about --untracked-files being inconsistent with rest of the flags (https://github.com/iterative/dvc/pull/7943#discussion_r963325138), so if we do make it normal by default, we may want to rename this to just --untracked.

skshetry avatar Sep 15 '22 06:09 skshetry

Maybe we can even get rid of the flag, enable --untracked-files=normal by default and change to --untracked-files=all when --granular is used.

skshetry avatar Sep 15 '22 06:09 skshetry

Going back to the original in rationale in https://github.com/iterative/dvc/pull/7943#issuecomment-1172604913:

  * Users don't expect to track the entire repo with DVC.
  * If we suggest doing `dvc data status` and `git status` as a pair, they become redundant for untracked files.

This still applies, and @mattseddon mentioned that a VS Code user already commented that it was confusing to see untracked files show up under both Git and DVC, so my preference is not to show them by default. Do you think it makes sense to always show untracked files?

change to --untracked-files=all when --granular is used.

It's a good idea, but I worry it would still get too busy if there is something like a virtualenv dir in the repo, especially since there's no target path support in dvc data status.

dberenbaum avatar Sep 15 '22 18:09 dberenbaum

Not planned for now

dberenbaum avatar May 20 '23 20:05 dberenbaum