dvc icon indicating copy to clipboard operation
dvc copied to clipboard

dvc diff or some dataset management tooling

Open villasv opened this issue 6 years ago • 37 comments

dvc plays really well with git, but one thing that I still miss in a data version control system that I really value in source version control systems is the tooling to inspect patches. Because data should be a deterministic reproducible output from the source code, almost all important changes are found in the code history. But frequently I also want to inspect what changed in the data.

Concrete scenario: let's say that I have an input csv and a transformation script that outputs a sanitized version of that input file. Then, I make a very small change in the sanitizing strategy, and run the command again. I can see that the output file changed hash, so I know that my code indeed changed behavior. But 99.999% of the outputs data points stayed the same, just a very minor portion of the file changed.

How can I inspect the data points that changed? Or the files in a very large output directory that changed? I can't git diff those files anymore because they're ignored. I believe that this is achievable with dvc.

villasv avatar Jun 13 '18 17:06 villasv

Hi @villasv !

Great idea! We will be sure to get to implementing it after 0.9.8.

Thanks, Ruslan

efiop avatar Jun 13 '18 19:06 efiop

Thank you @villasv. The idea is great!

But we should understand that in many cases dvc works in Gb size scale where diff has no meaning. The command might be abbused and many users can come up with a conclusion that dvc diff is a buggy\slow command.

Let's check file size before dvc diff and exit with an error if the size if exceeds the limit. The limit should be defined in dvc config file. Default value .... How do you think about 100Mb?

How do you guys think about introducing the limit?

dmpetrov avatar Jun 13 '18 23:06 dmpetrov

@dmpetrov I thought about introducing a limit, but there is actually no good reason to do that, since we can simply leave it for user to decide whether he wants to kill dvc diff if it takes too long or if he wants to wait.

efiop avatar Jun 13 '18 23:06 efiop

btw... there are no diffs in HDFS. It is one more thing to consider.

just throwing out some idea - hide the FS specific and SCM\Git specific commands under a special command like dvc whatever diff

dmpetrov avatar Jun 13 '18 23:06 dmpetrov

I think that performance is a valid concern, and probably some aggressive warnings should be issued beforehand since dvc knows the file sizes instead of simply forbidding a certain file size. Perhaps some people will still want to diff even if it takes hours and outputs Gbs of data as well.

One extra care is that the previous revision from the git POV might not have the output file in the local cache, in which case it's necessary to either do a dvc pull or in the worst case a retroactive dvc repro.

villasv avatar Jun 13 '18 23:06 villasv

Relevant new development: https://youtu.be/fw6P6VFPo24

villasv avatar Jun 14 '18 23:06 villasv

@villasv I agree with you. File size limitation is definitely not a way to go. We might add a warning, but considering other tools(e.g. plain diff and git diff) don't do that, I think that we are pretty safe not printing anything as well. If the operation takes too long user could just CTRL+C it as usual :)

Missing local cache is also a great point and we will be sure to account for that(e.g. maybe something like proposed --fetch for dvc metrics).

Thanks for the link!

efiop avatar Jun 14 '18 23:06 efiop

Just a heads up. I investigated the library I mentioned (tdda) but it wasn't of much help. I made a small script that achieves what I want for CSV and JSONLines. I'm pretty sure if this was incorporated in DVC it would be a bit cleaner, because I woudn't need to invoke so many subprocesses, but I decided to implement it standalone instead of inside a PR just for a Proof of Concept, and maybe we don't really want to put something so file-type-specific and formatting-opinionated into DVC (yet).

Perhaps a dvc plugin? The first of its kind?

My goal was to inspect each line separately and aggregate additions (pprint the whole item), deletions (print the id of the deleted item) and editions (pprint a jsondiff).

Examples using my own project:

$ python code/reftest.py diff --data-file data/orgs.txt --index-path wikidata.@id
Comparing data/orgs.txt with ./.dvc/cache/e0/6bd8b3a5945f043687c7a4256b0291
Removed:
http://www.wikidata.org/entity/Q0
http://www.wikidata.org/entity/Q1
---
Edited:
http://www.wikidata.org/entity/Q414163 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'label': 'Academy '
                                                                           'of '
                                                                           'Sciences '
                                                                           'and '
                                                                           'Literature '
                                                                           'Mainz'}}}}}}}}
http://www.wikidata.org/entity/Q503473 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'names': {insert: [(2,
                                                                                      'UNIGE')]}}}}}}}}}
http://www.wikidata.org/entity/Q797585 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'names': ['Babasaheb '
                                                                            'Ambedkar '
                                                                            'University',
                                                                            'Bhimrao '
                                                                            'Ambedkar '
                                                                            'University']}}}}}}}}
http://www.wikidata.org/entity/Q1619487 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'label': 'Willy-Brandt-school'}}}}}}}}
---
Added:
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q28704642',
              'i18n': {'en': {'label': 'École Française Internationale de Kiev',
                              'names': ['EFIK',
                                        'École française internationale de '
                                        'Kiev']}}}}
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q57420477',
              'i18n': {'en': {'label': 'Kerch Polytechnic College',
                              'names': []}}}}
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q57428768',
              'i18n': {'en': {'label': None, 'names': []}}}}
---
Total: R(2) / E(1) / A(1)
$ python code/reftest.py diff --data-file data/orgs.csv --index-col 0
Comparing data/orgs.csv with ./.dvc/cache/99/769e4afd29269400056d9b17cc39a8
Removed:
---
Edited:
Q414163 {delete: [9], insert: [(9, 'Academy of Sciences and Literature Mainz')]}
Q1619487 {delete: [9], insert: [(9, 'Willy-Brandt-school')]}
Q1698809 {delete: [9], insert: [(9, 'Johanneum Breslau')]}
Q3438435 {delete: [9], insert: [(9, 'San Marcos University')]}
---
Added:
['Q28704642',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'École Française Internationale de Kiev',
 '',
 '',
 '',
 '',
 '']
['Q57420477',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Kerch Polytechnic College',
 '',
 '',
 '',
 '',
 '']
['Q57428768', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
---
Total: R(0) / E(4) / A(3)

The script:

import csv
import difflib
import fire
import json
import jsondiff
import pprint
import yaml
import subprocess
import sys


class ReftestManager:
    def __init__(self, data_file):
        self._new = data_file
        self._old = data_file + '.dvc'

        # now we translate _old to its true cache path

        self._dvc_root = subprocess.Popen(
            ['dvc', 'root'],
            stdout=subprocess.PIPE,
        ).communicate()[0].rstrip().decode('utf-8')

        dvc_file = subprocess.Popen(
            ['git', 'show', f'HEAD:{self._old}'],
            stdout=subprocess.PIPE,
        ).communicate()[0].rstrip().decode('utf-8')
        dvc_spec = yaml.load(dvc_file)

        cache_md5 = next(
            out['md5']
            for out in dvc_spec['outs']
            if out['path'].split('/')[-1] == data_file.split('/')[-1]
        )

        self._old = f'{self._dvc_root}/.dvc/cache/{cache_md5[:2]}/{cache_md5[2:]}'

        print(f"Comparing {self._new} with {self._old}")

    def diff(self, index_col=None, index_path=None):
        if not ((index_col is not None) ^ (index_path is not None)):
            raise ValueError("Inform either index col or index path")

        with open(self._new) as fnew:
            new_lines = fnew.readlines()
        with open(self._old) as fold:
            old_lines = fold.readlines()

        diff = [
            line for line in
            difflib.unified_diff(
                old_lines, new_lines,
                fromfile=self._old, tofile=self._new, n=0,
            )
            if not line.startswith('---')
                and not line.startswith('+++')
                and not line.startswith('@@')
        ]

        rmved = [line[1:] for line in diff if line[0] == '-']
        added = [line[1:] for line in diff if line[0] == '+']

        if index_col is not None:
            rmved = {
                line[index_col]: line
                for line in csv.reader(rmved)
            }
            added = {
                line[index_col]: line
                for line in csv.reader(added)
            }

        if index_path is not None:
            def follow_path(json, key_chain):
                for key in key_chain:
                    json = json[key]
                return json
            key_chain = index_path.split('.')
            rmved = {
                follow_path(json.loads(line), key_chain): json.loads(line)
                for line in rmved
            }
            added = {
                follow_path(json.loads(line), key_chain): json.loads(line)
                for line in added
            }

        edits = {k:(v,added[k]) for k,v in rmved.items() if k in added}
        rmved = {k:v for k,v in rmved.items() if k not in edits }
        added = {k:v for k,v in added.items() if k not in edits }

        print("Removed:")
        for k in rmved:
            print(k)
        print("---")

        print("Edited:")
        for k,(old,new) in edits.items():
            print(k, pprint.pformat(jsondiff.diff(old, new, syntax='explicit')))
        print("---")

        print("Added:")
        for k in added:
            pprint.pprint(added[k])
        print("---")

        print(f"Total: R({len(rmved)}) / E({len(edits)}) / A({len(added)})")


if __name__ == "__main__":
    fire.Fire(ReftestManager)

villasv avatar Oct 18 '18 20:10 villasv

Unfortunately, because I invoke dvc as a subproccess, sometimes I get the output of dvc root while checking for updates...:

FileNotFoundError: [Errno 2] No such file or directory: 'Checking for updates...\n./.dvc/cache/e0/6bd8b3a5945f043687c7a4256b0291'

:-) but that's really minor

villasv avatar Oct 18 '18 20:10 villasv

Hi @villasv !

This looks amazing! I see no reason to make it a plugin, since it looks very suiting to be a part of core dvc functionality. Would you like to file a PR with dvc diff? Please feel free to ping us if you need any help.

Thanks, Ruslan

efiop avatar Oct 18 '18 21:10 efiop

@villasv Btw, we have introduced a community chat recently at https://dvc.org/chat and we would be very honored to have you there :slightly_smiling_face:

efiop avatar Oct 18 '18 21:10 efiop

Big thanks to @django-kz for contributing dvc diff command in https://github.com/iterative/dvc/pull/1778 ! :tada: :rocket: Currently it shows a difference in a number of files and their sizes between git revisions. We could add actual data diff functionality on top of it :slightly_smiling_face: dvc diff is going to be released in 0.35.1 today, please feel free to give it a try!

efiop avatar Apr 07 '19 23:04 efiop

Hi, I just discovered this issue mentioned on Discord. Here's an idea: Not ideal but we could provide a short guide in the docs for now on how to use git checkout, dvc checkout, copy the tracked text file(s) in question (from 2 different versions) to /tmp/a/ and /tmp/b/ , and run diff on them manually.

jorgeorpinel avatar May 24 '19 00:05 jorgeorpinel

That would be amazing @jorgeorpinel ! Great idea!

efiop avatar May 24 '19 00:05 efiop

Don't like to put this hack into the official docs to be honest. Especially if we are thinking about implementing this in the future. I would document this workaround in this ticket and may be put a link from the doc.

shcheklein avatar May 24 '19 02:05 shcheklein

Or a trick I’ve seen around: a question and semi-official-but-ok-to-become-dated answer on Stack Overflow

villasv avatar May 24 '19 02:05 villasv

OK here's a quick example on how to do it based on https://github.com/iterative/example-get-started (assumes the project has been cloned and user moved into the local repo). Also, we'll focus on files model.pkl and auc.metric:

EDIT: The behavior of dvc diff has changed slightly over time, but the general idea should still be valid.

$ dvc pull -aT
Preparing to download data from 'https://remote.dvc.org/get-started'
...
$ dvc diff HEAD^
dvc diff from 30a96ce to 6c73875
...
diff for 'model.pkl'
-model.pkl with md5 a66489653d1b6a8ba989799367b32c43
+model.pkl with md5 3863d0e317dee0a55c4e59d2ec0eef33
...
diff for 'auc.metric'
-auc.metric with md5 f58e5ccd66bf1195b53f458e7f619ab8
...
$ mkdir /tmp/a
$ cp model.pkl /tmp/a/model.pkl
$ cp auc.metric /tmp/a/auc.metric
$ git checkout HEAD^
...
$ dvc checkout
[##############################] 100% Checkout finished!
$ mkdir /tmp/b
$ cp model.pkl /tmp/b/model.pkl
$ cp auc.metric /tmp/b/auc.metric
$ diff /tmp/a/model.pkl /tmp/b/model.pkl 
Binary files /tmp/a/model.pkl and /tmp/b/model.pkl differ

Finally, use diff once all the changed files of interest are ready to be compared:

$ diff /tmp/a/auc.metric /tmp/b/auc.metric
1c1
< 0.602818
---
> 0.588426

UPDATE: Link to this comment added to https://dvc.org/doc/command-reference/diff in https://github.com/iterative/dvc.org/pull/490/commits/9ce552a426387731c1a62f81e57b0a4fc4c7353e

  • If someone asks the question on SO, I'll be glad to copy this example as answer.

jorgeorpinel avatar Jul 18 '19 07:07 jorgeorpinel

  • [ ] Remember to update the dvc diff cmd ref (possibly removing the note that links to this issue) and tam completion scripts (if needed) when/if this is addressed.

jorgeorpinel avatar Dec 10 '19 22:12 jorgeorpinel

Hi, I have read the docs and this thread but I still have problems with dvc diff command. Probably I'm missing something and I would be glad if someone will correct me. I have a problematic use case which I describe below.

I have a directory with several binary files under control of dvc. Whenever I execute dvc diff I get statistics about changed files. For example:

$dvc diff -t data/myfiles 2a19a244a218088c8f1313f5c528d9cf878bb7af
diff for 'data/myfiles'
-data/myfiles with md5 51cf02d89b91e204a0876563823f2c90.dir
+data/myfiles with md5 3e5c51dbc302efbd9e9c821483a19e8f.dir

116 files untouched, 1 file modified, 1 file added, 0 files deleted, size was increased by 12.3 kB

The output is not really helpful because it doesn't provide any information what exactly has changed in the directory. Of course it shows me that some files have changed but I knew it anyway. Since I work with binary files I don't think I need line-by-line diff but rather list of changed binary files. The possible output for dvc diff may look like this:

$dvc diff -t data/myfiles 2a19a244a218088c8f1313f5c528d9cf878bb7af
diff for 'data/myfiles'
-data/myfiles with md5 51cf02d89b91e204a0876563823f2c90.dir
+data/myfiles with md5 3e5c51dbc302efbd9e9c821483a19e8f.dir

116 files untouched, 1 file modified, 1 file added, 0 files deleted, size was increased by 12.3 kB

Modified:
-data/myfiles/117.jpg

Added:
-data/myfiles/118.jpg

P.S.: Of course I can checkout different versions of data/myfiles, save them to temporary directories and then calculate hash sums for each pair of files in those temporary directories but it does seems like too much work for such a simple task. I really hope there is a better way.

nik123 avatar Dec 24 '19 03:12 nik123

Thank you @nik123 ! I've just created a detailed issue about the dvc diff output #2982. Please vote for it! And please add comments if I missed anything.

dmpetrov avatar Dec 24 '19 10:12 dmpetrov

The output is not really helpful because it doesn't provide any information what exactly has changed in the directory.

@nik123 If the directory was dvc added as a whole, currently DVC doesn't examine what's inside so you only get those general stats. We are working on providing this kind of granularity for all of our commands though 🙂 Please see Dmitry's https://github.com/iterative/dvc/issues/2982#issuecomment-568715481 like he mentioned.

jorgeorpinel avatar Dec 24 '19 16:12 jorgeorpinel

@nik123 , @dmpetrov - thank you for your response! The issue #2982 is exactly what I need

nik123 avatar Dec 25 '19 04:12 nik123

Big thanks to @django-kz for contributing dvc diff command in #1778 ! 🎉 🚀 Currently it shows a difference in a number of files and their sizes between git revisions. We could add actual data diff functionality on top of it 🙂 dvc diff is going to be released in 0.35.1 today, please feel free to give it a try! @efiop

Copied from Discord, qna channel:

I think what the comment there meant to me, and what my use case needs, is a data diff. Let's say I changed the dataset, then committed the change. How do I know what changed? Right now dvc diff only shows what files changed. It would've been so beneficial if it shows something like git diff with a line by line diff

ammarasmro avatar May 19 '20 20:05 ammarasmro

Hi @ammarasmro

It would've been so beneficial if it shows something like git diff with a line by line diff

Right, this is discussed above. There are probably thousands of data formats so this is not easy as with regular diff which only works on plain text files. Do you have a more specific use case? E.g. a certain data format or a set of formats? Thanks

jorgeorpinel avatar May 20 '20 02:05 jorgeorpinel

So the particular use case that raised this issue was a text dataset. The process we, the ML team, have used, is that after the data team gets data in any format, they process it and we get it as CSV, TSV, TXT, JSON, .md files. So alot of our experiments at least start with a text data file. We also use other formats like pickle but I'm guessing diffing would be more complicated than text.

ammarasmro avatar May 20 '20 13:05 ammarasmro

Would be great to have line-by-line comparison at least for "text-based" files (TXT, CSV, JSON, MD, YAML, TOML, XLM, ...). By the way, great job with dvc, it's making my life easier!

stefanocoretta avatar Aug 22 '20 15:08 stefanocoretta

@ammarasmro and @stefanocoretta for now you can refer to the https://github.com/iterative/dvc/issues/770#issuecomment-512693256 above for a procedure to do that. It's not clear that we want such a feature for really large files, even when they're plaintext. But yes, maybe!

jorgeorpinel avatar Aug 22 '20 18:08 jorgeorpinel

a request from a high-priority client:

  • They need the ability to diff text files (CSV, JSON, YAML, etc) tracked by DVC
  • The suggested workaround (below) is a little too verbose for them
dvc get --rev commit_sha -o myfile.csv.temp . myfile.csv
diff myfile.csv.temp myfile.csv

alex000kim avatar Sep 20 '22 15:09 alex000kim

Should it fall within DVC's feature set? Maybe we can find better diff tools to recommend instead.

jorgeorpinel avatar Sep 20 '22 21:09 jorgeorpinel

The request was mostly about an easy way to compare any text files versioned by DVC and not so much about showing data format-specific diff. I presume many people would expect dvc diff to work similarly to git diff, but for files versioned by DVC.

alex000kim avatar Sep 21 '22 02:09 alex000kim