multidiff icon indicating copy to clipboard operation
multidiff copied to clipboard

large inputs cause hang and never finish

Open fanleung opened this issue 6 years ago • 9 comments

I have install multidiff and I execute the command with window10 “multidiff test1.bin test2.bin” it doesn't work...

fanleung avatar Jun 29 '18 01:06 fanleung

Are you getting any output at all, for example an error or exception?

Does the program exit successfully or does it hang?

What’s your Python version? I haven’t tested with anything lower than 3.5

How big are your files? The required memory will be much more than the combined file sizes, so test with files that are a few kilobytes maximum.

I haven’t tested with windows at all, and it may be that some implementation details have been missed. I’ll get a VM on Monday to test this, but if you could provide he above info it would help me :)

juhakivekas avatar Jul 05 '18 13:07 juhakivekas

Hi, here is more detail as follows.

  1. Window 10, 64bit, Python 3.6.4
  2. I have install multidiff, and I execute the commandmultidiff -h. it output usage: multidiff [-h] [-p PORT] [-s] [-m MODE] [-i INFORMAT] [-o OUTFORMAT] [--html] [file [file ...]]

N E O N S E N S E augmentations inc ┌───────────────┐ │ M U L T I │ │ D I F F │ │ sensor module │ └───────────────┘

positional arguments: file file or directory to include in multidiff

optional arguments: -h, --help show this help message and exit -p PORT, --port PORT start a local socket server on the given port -s, --stdin read data from stdin, objects split by newlines -m MODE, --mode MODE mode of operation, either "baseline" or "sequence" -i INFORMAT, --informat INFORMAT input data format: utf8 (stdin default) raw (file and server default) hex json -o OUTFORMAT, --outformat OUTFORMAT output data format: utf8 hex hexdump (default) --html use html for colors instead of ansi codes

  1. now I want to diff two bin file and I input the command multidiff 1.bin 2.bin. It hangs and output nothing... Only Ctrl+c can exit, and the output information is : C:\Users\Fanleung\Desktop\multidiff-master\multidiff-master>multidiff 1.bin 2.bin Traceback (most recent call last): File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\Scripts\multidiff-script.py", line 6, in <module> from pkg_resources import load_entry_point File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\pkg_resources\__init__.py", line 70, in <module> from pkg_resources.extern import appdirs File "<frozen importlib._bootstrap>", line 971, in _find_and_load File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 656, in _load_unlocked File "<frozen importlib._bootstrap>", line 626, in _load_backward_compatible File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\pkg_resources\extern\__init__.py", line 43, in load_module __import__(extant) File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\pkg_resources\_vendor\appdirs.py", line 510, in <module> import win32com.shell File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\win32com\__init__.py", line 6, in <module> import pythoncom File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\pythoncom.py", line 3, in <module> pywintypes.__import_pywin32_system_module__("pythoncom", globals()) File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\win32\lib\pywintypes.py", line 123, in __import_pywin32_system_module__ mod = imp.load_dynamic(modname, found) File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\imp.py", line 343, in load_dynamic return _load(spec) KeyboardInterrupt

fanleung avatar Jul 09 '18 10:07 fanleung

I set up a virtual machine and tested this in Windows 10 with python 3.7 and everything seems to work fine. Did you install multidiff by running python setup.py install? Are you sure the files are not too large?

juhakivekas avatar Jul 10 '18 08:07 juhakivekas

ahh... When I create two test bin file, and the file size is about 4KB, It done. what is the limit size of the file?

fanleung avatar Jul 10 '18 09:07 fanleung

The limit is constrained by the available ram on your computer so I’m not sure but I’d say it’s in the tens or hundreds of megabytes. The whole file will be printed too so a large file might be a little unpractical to work with.

What kind of data are you looking at and what are you trying to find? Small differences in large files?

juhakivekas avatar Jul 10 '18 13:07 juhakivekas

Yes, The kind of data I trying to find is small differences in large file. The last time the program hung, the test file size was about 4MB. This time I try to diff two 1MB files and It can run but costs a lot of time(1 min). Maybe 4MB file also can work but I didn't wait.

fanleung avatar Jul 11 '18 02:07 fanleung

Yes, that's due to the python difflib needing to always diff the whole sequence. The difflib documentation says:

Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

So in the worst case scenario 4M takes 4^3=128 times as long as 1M, which is clearly too long for your use. I also assume you wouldn't really want to see all the matching parts, but only the differing ones? I think making the tool faster would need some work on the underlying diffing algorithms, which is something I'm unlikely to have time for in the near future.

juhakivekas avatar Jul 11 '18 09:07 juhakivekas

I modified the multidiff library to only show the addresses where the bytes have changed. The output now generates a diff which only shows the addresses where the bytes have changed. This only works for hexdump outformat.

The output looks something like this - https://pastebin.com/csT3dpRK

I tested it on a 2MB file and it took me approxmately 10 minutes.

Utkarsh1308 avatar May 30 '19 00:05 Utkarsh1308

Great, if you want me to merge those changes, then just make a pull request but add a flag to the commandline for the feature. I think the hang issue is related to calculating the diff rather than outputting the result so I wont be closing this issue :)

juhakivekas avatar Jun 05 '19 11:06 juhakivekas