icdiff icon indicating copy to clipboard operation
icdiff copied to clipboard

icdiff hanging on large files

Open ChamiLamelas opened this issue 2 years ago • 5 comments

Hello,

My colleague and I were interested in using icdiff inside of a university autograding software (linked here). However, we have encountered an issue where calling icdiff on somewhat large files seems to hang.

I have attached two ~1.5 MB UTF-8 files here, these files do differ. Some notes from running diff and icdiff on these files manually:

  • Standard diff runs in less than a second.
  • However, icdiff as well as python3 -m icdiff both seem to hang after displaying file1.txt and file2.txt.

Any input on this issue would be appreciated.

file1.txt file2.txt

ChamiLamelas avatar Jul 01 '23 01:07 ChamiLamelas

This is an issue in python's difflib. You can get diff.py from https://docs.python.org/3/library/difflib.html and you'll see python3 diff.py file1.txt file2.txt hangs (or just is very slow).

I think this is "expected behavior" for difflib: "The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear." --https://docs.python.org/3.5/library/difflib.html#difflib.SequenceMatcher

You could consider filing a bug on difflib, or contributing something there that improves worst-case behavior?

jeffkaufman avatar Jul 04 '23 20:07 jeffkaufman

Is adding an option to use cdifflib something you'd consider?

https://stackoverflow.com/a/66694154/8742968

https://pypi.org/project/cdifflib/

mattrussell2 avatar Jul 04 '23 21:07 mattrussell2

@mattrussell2 I think that wouldn't solve the problem? Skimming that package. It looks like a re-implementation in C, but still the same algorithm?

(It's also missing some parts of difflib that icdiff depends on, but that might not be insurmountable?)

jeffkaufman avatar Jul 04 '23 22:07 jeffkaufman

Fair enough. Yeah 4x is certainly not going to fix polynomial-time issues.

If anyone's interested, the relevant issue re: difflib is here: https://github.com/python/cpython/issues/51180

mattrussell2 avatar Jul 04 '23 22:07 mattrussell2

What about using google's diff-match-patch?

https://github.com/google/diff-match-patch

This also takes an optional timeout parameter.

From my very quick reading of the source code here, seems like this library should be interoperable....but I'll dig into it a bit more.

mattrussell2 avatar Jul 04 '23 22:07 mattrussell2