ydiff
ydiff copied to clipboard
Hang on large files
Nice work, I have been using this tool for a while and works great. However it seems to hang when displaying differences in large files. For example:
diff -u largelog1.log largelog2.log | ydiff # Hangs
Displaying stored differences also hangs:
diff -u largelog1.log largelog2.log > largediff.dif
cat largediff.dif | ydiff # Hangs
Note:
cat largediff.dif | less # Does not hang
Hi!
Would you mind tell me:
- Your OS info
- Python version
- Do you have any options enabled via environment variable? (
env | grep DIFF_OPTIONS
) - How large is the largediff.dif file?
- Does the diff file contain very long lines?
It might be poor performance of the builtin Python difflib. I will take a look at as soon as possible.
Thank you.
On Thu, Sep 1, 2022 at 7:27 PM Guillermo García Bunster < @.***> wrote:
Nice work, I have been using this tool for a while and works great. However it seems to hang when displaying differences in large files. For example:
diff -u largelog1.log largelog2.log | ydiff # Hangs
Displaying stored differences also hangs:
diff -u largelog1.log largelog2.log > largediff.dif cat largediff.dif | ydiff # Hangs
Note:
cat largediff.dif | less # Does not hang
— Reply to this email directly, view it on GitHub https://github.com/ymattw/ydiff/issues/109, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKXRDZQH5TXHWS52ODV56TV4DRPHANCNFSM6AAAAAAQCSCUZ4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi,
- OS info: Centos 7
- Python version: 3.9.2
- Do you have any options enabled via environment variable?: No
- How large is the largediff.dif file?: 2MB and ~50k lines
- Does the diff file contain very long lines?: 500 chars max width
Thank you. 2MB with 500 chars max width should not cause any performance issue. The code does not load the whole file into memory before handling it, internally it mostly read line by line and yield a chunk when ready, so large file should not be a problem at all.
I originally thought it might be a regression of the difflib in python3, but I reproduced the problem with a 5 MB diff file with both python 2.7 and 3.10. This is interesting, the problem seems to be in mdiff(), so still in the python library.
I do not have enough time to troubleshoot soon but I will try my best...
For your use case maybe use vimdiff largelog1.log largelog2.log
instead for now.
@ggarcia-ee Could you confirm that your test diff contains large diff chunk? Look for the @@
metadata, below example is a comparable large chunk, it means line 2-1380 of the old file differs to line 6-1387 of the new file.
@@ -2,1380 +6,1387 @@
I have a proof that the Python difflib performs poorly for large hunks. Just wanted to confirm this is also the case on your side.
If you leave the ydiff.py command running, it should eventually finish. Try use time
to measure.
Here is the proof that Python difflib performs very poorly on large "hunk" set.
⮕ % make profile-difflib
tests/profile.sh tests/large-hunk/tao.diff
Wed Jun 19 21:17:55 2024 stats.3624842.tmp
63246210 function calls (63145498 primitive calls) in 26.623 seconds
Ordered by: internal time
List reduced from 506 to 79 due to restriction <'ydiff|difflib'>
ncalls tottime percall cumtime percall filename:lineno(function)
491819 8.660 0.000 11.380 0.000 difflib.py:622(quick_ratio)
5166013 5.031 0.000 8.228 0.000 difflib.py:651(real_quick_ratio)
53538/3916 3.679 0.000 26.256 0.007 difflib.py:893(_fancy_replace)
5168599 1.583 0.000 1.583 0.000 difflib.py:196(set_seq1)
5663054 1.211 0.000 1.211 0.000 difflib.py:39(_calculate_ratio)
20477 0.470 0.000 0.652 0.000 difflib.py:266(__chain_b)
21404 0.410 0.000 0.502 0.000 difflib.py:305(find_longest_match)
2464 0.094 0.000 0.159 0.000 ydiff.py:433(_fit_with_marker_mix)
2787 0.071 0.000 0.090 0.000 ydiff.py:106(strsplit)
6435 0.071 0.000 0.600 0.000 difflib.py:421(get_matching_blocks)
283636 0.039 0.000 0.039 0.000 difflib.py:1061(IS_CHARACTER_JUNK)
85178 0.034 0.000 0.041 0.000 difflib.py:717(<genexpr>)
21089 0.021 0.000 0.674 0.000 difflib.py:222(set_seq2)
50978/3914 0.017 0.000 17.518 0.004 difflib.py:987(_fancy_helper)
1391 0.016 0.000 26.581 0.019 ydiff.py:416(_markup_side_by_side)
5222 0.015 0.000 0.493 0.000 difflib.py:597(ratio)
1408 0.011 0.000 26.296 0.019 difflib.py:1438(_line_iterator)
2774 0.009 0.000 0.017 0.000 difflib.py:1382(_make_line)
2 0.008 0.004 0.026 0.013 ydiff.py:284(parse)
1213 0.005 0.000 0.140 0.000 difflib.py:492(get_opcodes)
1388 0.005 0.000 26.304 0.019 difflib.py:1526(_line_pair_iterator)
21663 0.005 0.000 0.005 0.000 difflib.py:619(<genexpr>)
2774 0.004 0.000 0.094 0.000 ydiff.py:149(strtrim)
4935 0.004 0.000 0.062 0.000 difflib.py:999(_qformat)
2774 0.003 0.000 0.006 0.000 ydiff.py:419(_normalize)
2424 0.003 0.000 0.057 0.000 difflib.py:715(_keep_original_ws)
2586 0.002 0.000 0.040 0.000 difflib.py:184(set_seqs)
1334 0.002 0.000 0.003 0.000 difflib.py:1415(record_sub_info)
1 0.002 0.002 26.613 26.613 ydiff.py:590(markup_to_pager)
2771 0.002 0.000 0.006 0.000 ydiff.py:254(is_old)
2774 0.002 0.000 0.003 0.000 ydiff.py:626(decode)
1374 0.002 0.000 0.011 0.000 difflib.py:120(__init__)
3919 0.001 0.000 26.261 0.007 difflib.py:833(compare)
2772 0.001 0.000 0.002 0.000 ydiff.py:225(is_hunk_meta)
1408 0.001 0.000 0.001 0.000 difflib.py:1460(<listcomp>)
4158 0.001 0.000 0.002 0.000 ydiff.py:219(is_old_path)
4157 0.001 0.000 0.002 0.000 ydiff.py:222(is_new_path)
2771 0.001 0.000 0.001 0.000 ydiff.py:176(append)
2771 0.001 0.000 0.001 0.000 ydiff.py:251(parse_hunk_line)
1387 0.001 0.000 0.002 0.000 ydiff.py:261(is_new)
1391 0.001 0.000 26.582 0.019 ydiff.py:374(markup)
1702 0.001 0.000 0.001 0.000 ydiff.py:102(colorize)
1388 0.001 0.000 26.305 0.019 difflib.py:1340(_mdiff)
1 0.000 0.000 0.000 0.000 ydiff.py:200(<listcomp>)
1 0.000 0.000 0.000 0.000 ydiff.py:203(<listcomp>)
310 0.000 0.000 0.000 0.000 ydiff.py:370(<lambda>)
1 0.000 0.000 26.615 26.615 ydiff.py:668(main)
1 0.000 0.000 26.623 26.623 ydiff.py:1(<module>)
1 0.000 0.000 0.001 0.001 difflib.py:1(<module>)
49 0.000 0.000 0.000 0.000 difflib.py:879(_plain_replace)
62 0.000 0.000 0.000 0.000 difflib.py:874(_dump)
1 0.000 0.000 26.615 26.615 ydiff.py:652(entry_wrapper)
16 0.000 0.000 0.000 0.000 ydiff.py:54(<genexpr>)
1 0.000 0.000 0.000 0.000 difflib.py:44(SequenceMatcher)
1 0.000 0.000 0.000 0.000 ydiff.py:359(__init__)
1 0.000 0.000 0.001 0.001 ydiff.py:182(mdiff)
1 0.000 0.000 0.000 0.000 ydiff.py:233(parse_hunk_meta)
1 0.000 0.000 0.000 0.000 ydiff.py:681(_process_args)
1 0.000 0.000 0.000 0.000 difflib.py:1666(HtmlDiff)
1 0.000 0.000 0.000 0.000 ydiff.py:357(DiffMarker)
1 0.000 0.000 0.000 0.000 ydiff.py:211(UnifiedDiff)
1 0.000 0.000 0.000 0.000 difflib.py:1303(ndiff)
1 0.000 0.000 0.000 0.000 ydiff.py:199(_get_old_text)
1 0.000 0.000 0.000 0.000 ydiff.py:167(Hunk)
1 0.000 0.000 0.000 0.000 difflib.py:724(Differ)
1 0.000 0.000 0.000 0.000 ydiff.py:35(Color)
1 0.000 0.000 0.000 0.000 ydiff.py:202(_get_new_text)
1 0.000 0.000 0.000 0.000 ydiff.py:169(__init__)
1 0.000 0.000 0.000 0.000 ydiff.py:369(<lambda>)
2 0.000 0.000 0.000 0.000 ydiff.py:213(__init__)
1 0.000 0.000 0.000 0.000 difflib.py:810(__init__)
3 0.000 0.000 0.000 0.000 ydiff.py:264(is_common)
1 0.000 0.000 0.000 0.000 ydiff.py:651(trap_interrupts)
1 0.000 0.000 0.000 0.000 ydiff.py:366(<lambda>)
1 0.000 0.000 0.000 0.000 ydiff.py:673(PassThroughOptionParser)
1 0.000 0.000 0.000 0.000 ydiff.py:279(DiffParser)
1 0.000 0.000 0.000 0.000 ydiff.py:367(<lambda>)
1 0.000 0.000 0.000 0.000 ydiff.py:281(__init__)
1 0.000 0.000 0.000 0.000 ydiff.py:732(<listcomp>)