dvc
dvc copied to clipboard
pull --glob: never matches
Bug Report
Description
Descpite having matching stages dvc pull
with --glob
option never finds any matches.
Reproduce
I can not share my code, and I don't think preparing a toy example is needed here.
Expected
Pull all the outputs of stages matching the expression.
Environment information
Output of dvc doctor
:
DVC version: 2.20.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.4.0-124-generic-x86_64-with-glibc2.31
Supports:
azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.10.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/ubuntu--vg-home--vg
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-home--vg
Repo: dvc, git
Additional Information (if any):
I am not sure if dvc doctor
outputs correct information about cache types. In my config I have "reflink,copy"
, while above we can see Cache types: hardlink, symlink
. This looks to me like another bug.
Kind regards, macio232
I guess this is the same problem mentioned in https://github.com/iterative/dvc/issues/6671#issuecomment-925468495
The issue is just that the existing implementation of --glob is very naive - it can only apply glob patterns to files which already exist in the local workspace. It does not support globbing against outputs within the repo tree (that do not already exist in the workspace).
Basically --glob is currently only useful for updating some subset of previously checked out or pulled data, or for pushing some subset of the existing data in your workspace.
@karajan1001 It looks like there is a dedicated issue (https://github.com/iterative/dvc/issues/5864) that I didn't find before opening mine. I think this one can be closed as a duplicate.
What makes me worried is that this problem hasn't been addressed since April 2021 :(
Yeah, but I think maybe we need to make it more clear in the document, as we have already received this kind of report several times.
Can we get a summary of the restrictions of --glob
for all commands that have it? add
, pull
, push
, repro
. ~~Feel free to transfer this to dvc.org
repo.~~
Thanks
Specifically for pull this seems more like a bug than something to document. What good is it to "pull" something that's already present?
From https://github.com/iterative/dvc.org/pull/3933#pullrequestreview-1103001201
Answer from https://github.com/iterative/dvc.org/pull/3933#issuecomment-1247903567:
for the other three commands their target is all local.. I don't think
dvc add --glob
(for example) would miss any file that didn't exist in the workspace.