dvc
dvc copied to clipboard
data status: show untracked files in normal mode
From https://github.com/iterative/dvc.org/pull/3812#discussion_r931002653:
Is it a performance optimization?
It no longer is, or will no longer be of performance concern. We didn't have
--untracked-files=normalsupport in dulwich so we had to use--untracked-files=all, so it'd be noisy and slow if you have too many untracked files.But thanks to @dtrifiro's great work in pygit2, the
git.status()is about 30x faster. Plus, we now have--untracked-files=normalsupport in pygit2, which will be much faster even for a very large repository. So there's no performance issue now if we decide to revisit. See iterative/scmrepo#118.(I have been testing a repo with 150,000 untracked files from a dataset, and
--untracked-files=normaltakes ~20ms now)
Note that we now have the support in upstream scmrepo, but we are waiting for a bugfix from dulwich to release a new version.
@dberenbaum, we have the support for this in upstream now.
Also there have been questions about --untracked-files being inconsistent with rest of the flags (https://github.com/iterative/dvc/pull/7943#discussion_r963325138), so if we do make it normal by default, we may want to rename this to just --untracked.
Maybe we can even get rid of the flag, enable --untracked-files=normal by default and change to --untracked-files=all when --granular is used.
Going back to the original in rationale in https://github.com/iterative/dvc/pull/7943#issuecomment-1172604913:
* Users don't expect to track the entire repo with DVC. * If we suggest doing `dvc data status` and `git status` as a pair, they become redundant for untracked files.
This still applies, and @mattseddon mentioned that a VS Code user already commented that it was confusing to see untracked files show up under both Git and DVC, so my preference is not to show them by default. Do you think it makes sense to always show untracked files?
change to
--untracked-files=allwhen--granularis used.
It's a good idea, but I worry it would still get too busy if there is something like a virtualenv dir in the repo, especially since there's no target path support in dvc data status.
Not planned for now