dictdiffer icon indicating copy to clipboard operation
dictdiffer copied to clipboard

possible to diff lists of dicts?

Open netllama opened this issue 5 years ago • 16 comments

I'm trying to use dictdiffer to get the diff of two lists of dictionaries, but its not working well. It seems to get confused whenever the ordering of the dicts inside the list is not identical between the two lists, or when the number of dictionaries in one list is (slightly) different from the other list.

Is this functionality something that is currently possible and expected to work, or am I trying to accomplish something unsupported?

In case it matters, this is an example of a list of dictionaries that I'm trying to work with:

[{'prefix': '162.245.48.104/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.112/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.120/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.128/27',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.16/28',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.160/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.168/29',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.176/28',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.192/30',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.200/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.208/30',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.216/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.224/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.232/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.240/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.248/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.32/29',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.40/29',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.48/29',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.56/29',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.64/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.72/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.96/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.50.0/24',
  'effective_as_path_length': 1,
  'med': 6,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.51.112/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.51.120/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.51.128/28',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.51.144/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']}]

netllama avatar Nov 26 '19 21:11 netllama

Can you provide a smaller example with current and expected results?

There are some limitations to list comparison:

  • it can not detect changes in the order of elements
  • if an element was added in the middle of a list, then they will differ from that possition (remove + add)

jirikuncar avatar Nov 27 '19 11:11 jirikuncar

Thanks. The limitations that you noted are definitely present in the data that I'm working with. It that's expected behavior right now, that's fine. I'll need to use something other than dictdiffer for the project.

netllama avatar Nov 27 '19 16:11 netllama

I am confused by the statement: ”it can not detect changes in the order of elements”.

Since: >>> list(dictdiffer.diff([1,2],[2,1])) [('change', [0], (1, 2)), ('change', [1], (2, 1))]

Can you expand a bit?

mikaelho avatar Nov 27 '19 17:11 mikaelho

Now I explained it to myself. The ideal result, at least for some use cases, would be some kind of a ”swap” operation, instead of the two ”changes”.

mikaelho avatar Nov 27 '19 17:11 mikaelho

Yes, exactly. Sometimes, the order of dicts inside the list doesn't matter.

You should be able to reproduce my original issue by creating a second list of dicts (based on the one that I provided), and making a few minor changes:

  • Add another dict to the middle of the list with the same structure as the others dicts, with a unique value for the prefix key. Then diff the original and new lists. It should only report the addition of 1 new dict, but in my results, it reports multiple other changes erroneously.
  • Re-arrange the order of the dicts in the list, and diff the original and new lists. It shouldn't report any differences, but seems to get confused, and starts comparing each list dict element by index.

netllama avatar Nov 27 '19 17:11 netllama

I think your first issue is covered by @jirikuncar’s explanation of list diffs - everything is ”different” from the addition onward. Your second issue I think stems from the fact that dictdiffer does not compare dicts by identity, but by content - if you have shuffled the list, changing the first element, dictdiffer detects the difference in specific content within the dict, not in its identity.

If you find a better solution for your needs, please share your findings here as well.

mikaelho avatar Nov 27 '19 19:11 mikaelho

I have been experimenting with storing JSON structures in Git to do diffs on big nested objects.

You could do something like:

$ pip install git_json_tree
$ ipython
import subprocess
import git_json_tree

repo = git_json_tree.Repo('/tmp/git_json_tree')
sha_original = git_json_tree.encode(repo, ["your example goes here"])
sha_modified = git_json_tree.encode(repo, ["your modified example goes here"])

subprocess.call(['git', 'diff', sha_original, sha_modified])

jirikuncar avatar Nov 28 '19 12:11 jirikuncar

@jirikuncar I'm running into a similar issue with comparing both lists and sets... if I have two lists A,B

list_a = ["changeA", "changeB"]
list_b = ["changeA", "changeC", "changeB"]

Dictdiff correctly finds the ADD "changeC", but is also reporting a change in index 1 from changeB to changeC. Is there a way to ignore this?

I tried converting the list into sets (since they are unordered and unindexed):

set1 = set(["changeA", "changeB"])
set2 = set(["changeA", "changeC", "changeB"])

But dictdiff is still reporting a CHANGE as

<class 'list'>: [
('add', '', [(0, {'changeC'})]), 
('change', '', ({'changeA', 'changeB'}, {'changeA', 'changeC', 'changeB'}))
]

danielduhh avatar Dec 11 '19 01:12 danielduhh

@danielduhh I have tried with lastest dictdiffer version.

In [1]: import dictdiffer

In [2]: list_a = ["changeA", "changeB"]
   ...: list_b = ["changeA", "changeC", "changeB"]

In [3]: set1 = set(["changeA", "changeB"])
   ...: set2 = set(["changeA", "changeC", "changeB"])

In [4]: list(dictdiffer.diff(list_a, list_b))
Out[4]: [('change', [1], ('changeB', 'changeC')), ('add', '', [(2, 'changeB')])]

In [5]: list(dictdiffer.diff(set1, set2))
Out[5]:
[('add', '', [(0, {'changeC'})]),
 ('change', '', ({'changeA', 'changeB'}, {'changeA', 'changeB', 'changeC'}))]

I can confirm that the result for sets looks wrong. Can you submit a PR with test case and expected results?

jirikuncar avatar Dec 11 '19 10:12 jirikuncar

@danielduhh can you check if https://github.com/inveniosoftware/dictdiffer/pull/133 is solving your problem?

jirikuncar avatar Dec 11 '19 10:12 jirikuncar

FYI, this will be useful for those of us wishing to implement Strategic Three Way Merge Patch for the Kubernetes API :) for things like deployments, where you can have big lists of environment variables that look like so:

{
  "name": "MEDIA_URL",
  "valueFrom": {
    "configMapKeyRef": {
      "key": "MEDIA_URL",
      "name": "my-secret"
    }
  }
}

AFAIK the kubectl cli tool uses a schema that has pre-specified "patch merge keys" to essentially treat each element of the list like a value that is keyed by a specified property within (in the case above, it'd be "name")

as of dictdiffer 0.8.1, I'm still getting the wrong results for lists of dicts like this. I don't think you could solve this in a truly generalized way, but allowing schema-based diff could be a good approach.

Datamance avatar Dec 20 '19 22:12 Datamance

@Datamance basically you would like to provide a custom diff function for certain keys?

jirikuncar avatar Jan 06 '20 12:01 jirikuncar

That could work, and I think you could use that as a building block for the common case of schema-based (particularly JSONSchema) diffing/patching.

Datamance avatar Jan 08 '20 17:01 Datamance

So basically one shoud provide mapping from dotted_node to a function with a same signature as diff:

def diff(..., key=None):
    # after dotted_node ~ :158
    if dotted_node in key:
        for item in key[dotted_node](first, second, node=node, ignore=ignore, path_limit=path_limit, expand=expand, tolerance=tolerance, dot_notation=dot_notation):
            yield item
        return

jirikuncar avatar Jan 09 '20 10:01 jirikuncar

Any updates on this?

aorumbayev avatar Mar 26 '20 22:03 aorumbayev

@aorumbayev feel free to send a PR with a fix discussed above.

jirikuncar avatar Mar 26 '20 23:03 jirikuncar