diff-match-patch
diff-match-patch copied to clipboard
Diff'ing bytes
Is it possible to diff bytes in the Python3 library. I'm not able to.
jhogan@bastion:/usr/lib/python3/dist-packages/diff_match_patch$ python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from diff_match_patch import diff_match_patch
>>> from uuid import uuid4
>>> dmp = diff_match_patch()
>>> dmp.diff_main(uuid4().bytes, uuid4().bytes)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/diff_match_patch/diff_match_patch.py", line 136, in diff_main
self.diff_cleanupMerge(diffs)
File "/usr/lib/python3/dist-packages/diff_match_patch/diff_match_patch.py", line 936, in diff_cleanupMerge
text_delete += diffs[pointer][1]
TypeError: Can't convert 'bytes' object to str implicitly
Note, I'm using the Ubuntu version of the software (the python3-diff-match-patch package). Also note that that the stack trace can be different but it always causes an exception on the same line. For example:
>>> dmp.diff_main(uuid4().bytes, uuid4().bytes)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/diff_match_patch/diff_match_patch.py", line 129, in diff_main
diffs = self.diff_compute(text1, text2, checklines, deadline)
File "/usr/lib/python3/dist-packages/diff_match_patch/diff_match_patch.py", line 196, in diff_compute
return self.diff_bisect(text1, text2, deadline)
File "/usr/lib/python3/dist-packages/diff_match_patch/diff_match_patch.py", line 351, in diff_bisect
return self.diff_bisectSplit(text1, text2, x1, y1, deadline)
File "/usr/lib/python3/dist-packages/diff_match_patch/diff_match_patch.py", line 377, in diff_bisectSplit
diffs = self.diff_main(text1a, text2a, False, deadline)
File "/usr/lib/python3/dist-packages/diff_match_patch/diff_match_patch.py", line 136, in diff_main
self.diff_cleanupMerge(diffs)
File "/usr/lib/python3/dist-packages/diff_match_patch/diff_match_patch.py", line 936, in diff_cleanupMerge
text_delete += diffs[pointer][1]
I assume this is because each call is using random input (UUID's)
Is the solution to convert the binary data to strings first:
dmp.diff_main(str(uuid4().bytes), str(uuid4().bytes))
I think this project is for text only
For those who can't convert raw bytes to strings due to encoding issues, a workaround is to convert the bytes to hex sequences and compute the diff with those, since they are encoded in ascii.
Note that a byte will correspond to two characters, and the diff may generate blocks that have an odd number of characters. In those cases, I found that taking the last character (equivalent to a nibble) of an unchanged block and adding it to the beginning of the next changed blocks works fine.
See the following example, which deals with odd blocks and outputs differences in unified format.