dvc status --json can output non-json
Bug Report
Description
When there are large files to hash which are not cached, dvc status --json will still print out the message, which makes the output not valid json. I believe the use case of dvc status --json is to be able to pipe the output to a file and easily read it with another program, so extra messages make this inconvenient.
I accidentally erased the output I had but I think this is the message that is printed out: https://github.com/iterative/dvc-data/blob/300a3e072e5baba50f7ac5f91240891c0e30d030/src/dvc_data/hashfile/hash.py#L174
Reproduce
- large data file stage dependency
-
dvc status --jsonfor the first time
Expected
dvc status --json only outputs valid json
Environment information
Output of dvc doctor:
DVC version: 3.33.4 (choco)
---------------------------
Platform: Python 3.11.6 on Windows-10-10.0.19045-SP0
Subprojects:
dvc_data = 2.24.0
dvc_objects = 2.0.1
dvc_render = 1.0.0
dvc_task = 0.3.0
scmrepo = 1.6.0
Supports:
azure (adlfs = 2023.12.0, knack = 0.11.0, azure-identity = 1.15.0),
gdrive (pydrive2 = 1.19.0),
gs (gcsfs = 2023.12.2.post1),
http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
oss (ossfs = 2023.12.0),
s3 (s3fs = 2023.12.2, boto3 = 1.33.13),
ssh (sshfs = 2023.10.0)
Config:
Global: C:\Users\starrgw1\AppData\Local\iterative\dvc
System: C:\ProgramData\iterative\dvc
I think dvc status -q --json should be returning what you expect (json only) instead of only a return code. We have similar behavior in other commands like dvc data status -q --json. @skshetry What do you think?
I saw that in the documentation but it seemed like that would result in no output at all
do not write anything to standard output. Exit with 0 if data and pipelines are up to date, otherwise 1.
https://dvc.org/doc/command-reference/status#-q
You are right, @gregstarr, it outputs nothing today. I mean that we should change the behavior to work this way, since there's no reason to do dvc status -q --json unless you want JSON output. We generally will still show JSON even with -q in these scenarios, like in dvc data status -q --json for example.
Personally, I think just --json should basically imply "-q --json" because I can't think of any use case for using --json but wanting extra output
Discussed with the team. There are two issues here:
- We should not be writing those additional non-json logging messages to stdout
- We should be showing json regardless of
-qwhen--jsonflag is used
@gregstarr How are you using the command?
I have been using it like this:
dvc status --json <target> > status.json
I am using it to check the status of my pipelines and pass the json data to a minimal flask server for viewing.
Basically came as a result of this discussion post: https://discuss.dvc.org/t/ignore-files-in-stage-external-dependency-output/1889/2
I have a periodic task which clears out all the DS_Store files from my remote and checks the status of my pipelines.
@gregstarr dvc status -q --json should now provide only the JSON output.
@skshetry Do you want to keep it open as a reminder to stream logs to stderr here?
Thanks for looking into this!