dvc
dvc copied to clipboard
push/pull: Possible bug with individual file push and pull to Google Drive remote storage
Bug Report
Thank you for the fantastic work with DVC! I noticed an odd behavior while working with individual files and trying to push + pull these in a reproducible way. I felt this might be a strange enough behavior to warrant a bug type issue creation, but I could also see how I might be "doing it wrong" and need guidance about this. If it's the case that I'm doing things entirely incorrect, I feel that documentation updates may be warranted to help avoid this issue in the future for others who may do or think about things in a similar way.
Description
I usually use the data versioning guide as a reference when getting going with DVC projects. As part of that guide it covers adding individual files and pushing them to a remote.
When I add multiple individual files using DVC, push them to a remote, then attempt to pull those same pushed files I notice that I don't get back what I think I should. I created the following example repo in an effort to reproduce the issues, thinking that it might also have something to do with DVC package or dependency versions (I don't recall this behavior in earlier versions, but I'm not certain when it might have began or if that's truly related here). It seemed like .gitignore
's sometimes changed the behavior here in terms of what was "seen" or not, but it didn't seem to effect what was pushed or pulled.
As a workaround to the individual file addition inconsistencies I've found that adding the files as a directory seemed to work well. Adding a directory didn't appear in the data versioning guide guide as a preferred way of doing things, so it took some time to figure out what was happening. If the docs are updated I'd suggest providing strongly worded suggestions about DVC preferences on singular file vs directory additions (what's the better pattern to follow, or if there's no pattern / preference stating that openly).
Please see the following link for the example code: https://github.com/d33bs/demo-dvc-possible-push-bug Note: use any Google Drive folder ID you have access to within the config (I don't wish to share my own in this case to avoid security challenges which may be associated with this).
Reproduce
- Clone repo at: https://github.com/d33bs/demo-dvc-possible-push-bug
- Update DVC config with Google Drive folder accessible by relevant Google Account you own
- Run
poetry install
- Run
poetry run poe dvc_possible_bug
- View results
Expected
I'd expect that individual files or directories act similarly when added, pushed, and pulled using DVC.
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 3.43.1 (pip)
-------------------------
Platform: Python 3.11.2 on macOS
Subprojects:
dvc_data = 3.9.0
dvc_objects = 3.0.6
dvc_render = 1.0.1
dvc_task = 0.3.0
scmrepo = 2.1.1
Supports:
gdrive (pydrive2 = 1.19.0),
http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3)
Config:
Global: /Users/username/Library/Application Support/dvc
System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: gdrive
Additional Information (if any):
Thanks a ton for any help you may be able to provide, including suggestions towards best practices or errors in my approach!
@d33bs I don's see in the script dvc push
command at all, is it expected? Also the dvc add data/data_sub_dir/zen.zip
is duplicated.
In the remove data
script you are also removing the .dvc
file, it means dvc pull
can't bring it back. data
dir is not controlled by DVC, so what is the expected behavior in this case for you?
Thank you @shcheklein for the kind feedback and apologies for the earlier bugs. I've updated the repo just now based on your questions + comments. Despite this, I still seem unable to pull the files when they are added individually. Is there something else I'm possibly doing wrong? Thanks again for any guidance you can offer.
@d33bs good change.
I think, now you could also remove:
-/data
-!/data/*.dvc
-!/data/*/*.dvc
from the .gitignore
. DVC takes care of that automatically, and it seems these tricky conditions are causing some troubles (not sure why tbh, but it becomes less important to solve).
Could you give it try please?
Thanks @shcheklein ! This seems to have allowed DVC to perform the correct actions! I've updated the repo accordingly.
Some follow-up questions / thoughts:
- Does DVC impose the requirement of using nested
.gitignore
files? This is generally a pattern I personally avoid to help reduce the number of files for a project and provide a single place to look for.gitignore
rules (generally the root of the project). - If there is a requirement that DVC uses nested
.gitignore
files, could I suggest this be made more prominent in the guidance documentation (for example, in the data versioning guide I linked with the issue)? - If this isn't a requirement, is it possible that there's a bug in the way DVC reads the rules you mentioned I should remove?
Thank you again for your continued help with this!
hey, sure.
I would check this response by @pmrowla to utilize a single .gitignore + the way it should look like in your case.
Closing as this does not look like a bug in dvc, but is more of a documentation issue.
Hi @skshetry , thanks for the update here. Were the documentation related aspects mentioned here already addressed in another issue or related PR? If not, would it be possible to do so as part of addressing this issue (perhaps retitling/refocusing the issue towards that effect)?