dvc icon indicating copy to clipboard operation
dvc copied to clipboard

dvc status --json can output non-json

Open gregstarr opened this issue 2 years ago • 10 comments

Bug Report

Description

When there are large files to hash which are not cached, dvc status --json will still print out the message, which makes the output not valid json. I believe the use case of dvc status --json is to be able to pipe the output to a file and easily read it with another program, so extra messages make this inconvenient.

I accidentally erased the output I had but I think this is the message that is printed out: https://github.com/iterative/dvc-data/blob/300a3e072e5baba50f7ac5f91240891c0e30d030/src/dvc_data/hashfile/hash.py#L174

Reproduce

  1. large data file stage dependency
  2. dvc status --json for the first time

Expected

dvc status --json only outputs valid json

Environment information

Output of dvc doctor:

DVC version: 3.33.4 (choco)
---------------------------
Platform: Python 3.11.6 on Windows-10-10.0.19045-SP0
Subprojects:
        dvc_data = 2.24.0
        dvc_objects = 2.0.1
        dvc_render = 1.0.0
        dvc_task = 0.3.0
        scmrepo = 1.6.0
Supports:
        azure (adlfs = 2023.12.0, knack = 0.11.0, azure-identity = 1.15.0),
        gdrive (pydrive2 = 1.19.0),
        gs (gcsfs = 2023.12.2.post1),
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2023.12.2, boto3 = 1.33.13),
        ssh (sshfs = 2023.10.0)
Config:
        Global: C:\Users\starrgw1\AppData\Local\iterative\dvc
        System: C:\ProgramData\iterative\dvc

gregstarr avatar Jan 17 '24 15:01 gregstarr

I think dvc status -q --json should be returning what you expect (json only) instead of only a return code. We have similar behavior in other commands like dvc data status -q --json. @skshetry What do you think?

dberenbaum avatar Jan 17 '24 18:01 dberenbaum

I saw that in the documentation but it seemed like that would result in no output at all

do not write anything to standard output. Exit with 0 if data and pipelines are up to date, otherwise 1.

https://dvc.org/doc/command-reference/status#-q

gregstarr avatar Jan 17 '24 18:01 gregstarr

You are right, @gregstarr, it outputs nothing today. I mean that we should change the behavior to work this way, since there's no reason to do dvc status -q --json unless you want JSON output. We generally will still show JSON even with -q in these scenarios, like in dvc data status -q --json for example.

dberenbaum avatar Jan 17 '24 19:01 dberenbaum

Personally, I think just --json should basically imply "-q --json" because I can't think of any use case for using --json but wanting extra output

gregstarr avatar Jan 17 '24 23:01 gregstarr

Discussed with the team. There are two issues here:

  1. We should not be writing those additional non-json logging messages to stdout
  2. We should be showing json regardless of -q when --json flag is used

@gregstarr How are you using the command?

dberenbaum avatar Jan 23 '24 13:01 dberenbaum

I have been using it like this:

dvc status --json <target> > status.json

I am using it to check the status of my pipelines and pass the json data to a minimal flask server for viewing.

Basically came as a result of this discussion post: https://discuss.dvc.org/t/ignore-files-in-stage-external-dependency-output/1889/2

I have a periodic task which clears out all the DS_Store files from my remote and checks the status of my pipelines.

gregstarr avatar Jan 23 '24 14:01 gregstarr

@gregstarr dvc status -q --json should now provide only the JSON output.

@skshetry Do you want to keep it open as a reminder to stream logs to stderr here?

dberenbaum avatar Jan 24 '24 13:01 dberenbaum

Thanks for looking into this!

gregstarr avatar Jan 25 '24 14:01 gregstarr