dvc icon indicating copy to clipboard operation
dvc copied to clipboard

pull --glob: never matches

Open macio232 opened this issue 2 years ago • 6 comments

Bug Report

Description

Descpite having matching stages dvc pull with --glob option never finds any matches.

Reproduce

I can not share my code, and I don't think preparing a toy example is needed here.

Expected

Pull all the outputs of stages matching the expression.

Environment information

Output of dvc doctor:

DVC version: 2.20.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.4.0-124-generic-x86_64-with-glibc2.31
Supports:
	azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.10.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
	s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/ubuntu--vg-home--vg
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-home--vg
Repo: dvc, git

Additional Information (if any): I am not sure if dvc doctor outputs correct information about cache types. In my config I have "reflink,copy", while above we can see Cache types: hardlink, symlink. This looks to me like another bug.

Kind regards, macio232

macio232 avatar Aug 25 '22 15:08 macio232

I guess this is the same problem mentioned in https://github.com/iterative/dvc/issues/6671#issuecomment-925468495

The issue is just that the existing implementation of --glob is very naive - it can only apply glob patterns to files which already exist in the local workspace. It does not support globbing against outputs within the repo tree (that do not already exist in the workspace).

Basically --glob is currently only useful for updating some subset of previously checked out or pulled data, or for pushing some subset of the existing data in your workspace.

karajan1001 avatar Aug 26 '22 08:08 karajan1001

@karajan1001 It looks like there is a dedicated issue (https://github.com/iterative/dvc/issues/5864) that I didn't find before opening mine. I think this one can be closed as a duplicate.

What makes me worried is that this problem hasn't been addressed since April 2021 :(

macio232 avatar Aug 26 '22 09:08 macio232

Yeah, but I think maybe we need to make it more clear in the document, as we have already received this kind of report several times.

karajan1001 avatar Aug 27 '22 07:08 karajan1001

Can we get a summary of the restrictions of --glob for all commands that have it? add, pull, push, repro. ~~Feel free to transfer this to dvc.org repo.~~

Thanks

jorgeorpinel avatar Sep 12 '22 20:09 jorgeorpinel

Specifically for pull this seems more like a bug than something to document. What good is it to "pull" something that's already present?

From https://github.com/iterative/dvc.org/pull/3933#pullrequestreview-1103001201

jorgeorpinel avatar Sep 12 '22 20:09 jorgeorpinel

Answer from https://github.com/iterative/dvc.org/pull/3933#issuecomment-1247903567:

for the other three commands their target is all local.. I don't think dvc add --glob (for example) would miss any file that didn't exist in the workspace.

jorgeorpinel avatar Sep 20 '22 06:09 jorgeorpinel