ydiff icon indicating copy to clipboard operation
ydiff copied to clipboard

Hang on large files

Open ggarcia-ee opened this issue 2 years ago • 4 comments

Nice work, I have been using this tool for a while and works great. However it seems to hang when displaying differences in large files. For example:

diff -u largelog1.log largelog2.log | ydiff  # Hangs

Displaying stored differences also hangs:

diff -u largelog1.log largelog2.log > largediff.dif
cat largediff.dif | ydiff  # Hangs

Note:

cat largediff.dif | less  # Does not hang

ggarcia-ee avatar Sep 01 '22 17:09 ggarcia-ee

Hi!

Would you mind tell me:

  • Your OS info
  • Python version
  • Do you have any options enabled via environment variable? (env | grep DIFF_OPTIONS)
  • How large is the largediff.dif file?
  • Does the diff file contain very long lines?

It might be poor performance of the builtin Python difflib. I will take a look at as soon as possible.

Thank you.

On Thu, Sep 1, 2022 at 7:27 PM Guillermo García Bunster < @.***> wrote:

Nice work, I have been using this tool for a while and works great. However it seems to hang when displaying differences in large files. For example:

diff -u largelog1.log largelog2.log | ydiff # Hangs

Displaying stored differences also hangs:

diff -u largelog1.log largelog2.log > largediff.dif cat largediff.dif | ydiff # Hangs

Note:

cat largediff.dif | less # Does not hang

— Reply to this email directly, view it on GitHub https://github.com/ymattw/ydiff/issues/109, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKXRDZQH5TXHWS52ODV56TV4DRPHANCNFSM6AAAAAAQCSCUZ4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ymattw avatar Sep 01 '22 19:09 ymattw

Hi,

  • OS info: Centos 7
  • Python version: 3.9.2
  • Do you have any options enabled via environment variable?: No
  • How large is the largediff.dif file?: 2MB and ~50k lines
  • Does the diff file contain very long lines?: 500 chars max width

ggarcia-ee avatar Sep 01 '22 19:09 ggarcia-ee

Thank you. 2MB with 500 chars max width should not cause any performance issue. The code does not load the whole file into memory before handling it, internally it mostly read line by line and yield a chunk when ready, so large file should not be a problem at all.

I originally thought it might be a regression of the difflib in python3, but I reproduced the problem with a 5 MB diff file with both python 2.7 and 3.10. This is interesting, the problem seems to be in mdiff(), so still in the python library.

I do not have enough time to troubleshoot soon but I will try my best...

For your use case maybe use vimdiff largelog1.log largelog2.log instead for now.

ymattw avatar Sep 02 '22 16:09 ymattw

@ggarcia-ee Could you confirm that your test diff contains large diff chunk? Look for the @@ metadata, below example is a comparable large chunk, it means line 2-1380 of the old file differs to line 6-1387 of the new file.

@@ -2,1380 +6,1387 @@

I have a proof that the Python difflib performs poorly for large hunks. Just wanted to confirm this is also the case on your side.

If you leave the ydiff.py command running, it should eventually finish. Try use time to measure.

ymattw avatar Sep 03 '22 18:09 ymattw

Here is the proof that Python difflib performs very poorly on large "hunk" set.

⮕ % make profile-difflib
tests/profile.sh tests/large-hunk/tao.diff
Wed Jun 19 21:17:55 2024    stats.3624842.tmp

         63246210 function calls (63145498 primitive calls) in 26.623 seconds

   Ordered by: internal time
   List reduced from 506 to 79 due to restriction <'ydiff|difflib'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   491819    8.660    0.000   11.380    0.000 difflib.py:622(quick_ratio)
  5166013    5.031    0.000    8.228    0.000 difflib.py:651(real_quick_ratio)
53538/3916    3.679    0.000   26.256    0.007 difflib.py:893(_fancy_replace)
  5168599    1.583    0.000    1.583    0.000 difflib.py:196(set_seq1)
  5663054    1.211    0.000    1.211    0.000 difflib.py:39(_calculate_ratio)
    20477    0.470    0.000    0.652    0.000 difflib.py:266(__chain_b)
    21404    0.410    0.000    0.502    0.000 difflib.py:305(find_longest_match)
     2464    0.094    0.000    0.159    0.000 ydiff.py:433(_fit_with_marker_mix)
     2787    0.071    0.000    0.090    0.000 ydiff.py:106(strsplit)
     6435    0.071    0.000    0.600    0.000 difflib.py:421(get_matching_blocks)
   283636    0.039    0.000    0.039    0.000 difflib.py:1061(IS_CHARACTER_JUNK)
    85178    0.034    0.000    0.041    0.000 difflib.py:717(<genexpr>)
    21089    0.021    0.000    0.674    0.000 difflib.py:222(set_seq2)
50978/3914    0.017    0.000   17.518    0.004 difflib.py:987(_fancy_helper)
     1391    0.016    0.000   26.581    0.019 ydiff.py:416(_markup_side_by_side)
     5222    0.015    0.000    0.493    0.000 difflib.py:597(ratio)
     1408    0.011    0.000   26.296    0.019 difflib.py:1438(_line_iterator)
     2774    0.009    0.000    0.017    0.000 difflib.py:1382(_make_line)
        2    0.008    0.004    0.026    0.013 ydiff.py:284(parse)
     1213    0.005    0.000    0.140    0.000 difflib.py:492(get_opcodes)
     1388    0.005    0.000   26.304    0.019 difflib.py:1526(_line_pair_iterator)
    21663    0.005    0.000    0.005    0.000 difflib.py:619(<genexpr>)
     2774    0.004    0.000    0.094    0.000 ydiff.py:149(strtrim)
     4935    0.004    0.000    0.062    0.000 difflib.py:999(_qformat)
     2774    0.003    0.000    0.006    0.000 ydiff.py:419(_normalize)
     2424    0.003    0.000    0.057    0.000 difflib.py:715(_keep_original_ws)
     2586    0.002    0.000    0.040    0.000 difflib.py:184(set_seqs)
     1334    0.002    0.000    0.003    0.000 difflib.py:1415(record_sub_info)
        1    0.002    0.002   26.613   26.613 ydiff.py:590(markup_to_pager)
     2771    0.002    0.000    0.006    0.000 ydiff.py:254(is_old)
     2774    0.002    0.000    0.003    0.000 ydiff.py:626(decode)
     1374    0.002    0.000    0.011    0.000 difflib.py:120(__init__)
     3919    0.001    0.000   26.261    0.007 difflib.py:833(compare)
     2772    0.001    0.000    0.002    0.000 ydiff.py:225(is_hunk_meta)
     1408    0.001    0.000    0.001    0.000 difflib.py:1460(<listcomp>)
     4158    0.001    0.000    0.002    0.000 ydiff.py:219(is_old_path)
     4157    0.001    0.000    0.002    0.000 ydiff.py:222(is_new_path)
     2771    0.001    0.000    0.001    0.000 ydiff.py:176(append)
     2771    0.001    0.000    0.001    0.000 ydiff.py:251(parse_hunk_line)
     1387    0.001    0.000    0.002    0.000 ydiff.py:261(is_new)
     1391    0.001    0.000   26.582    0.019 ydiff.py:374(markup)
     1702    0.001    0.000    0.001    0.000 ydiff.py:102(colorize)
     1388    0.001    0.000   26.305    0.019 difflib.py:1340(_mdiff)
        1    0.000    0.000    0.000    0.000 ydiff.py:200(<listcomp>)
        1    0.000    0.000    0.000    0.000 ydiff.py:203(<listcomp>)
      310    0.000    0.000    0.000    0.000 ydiff.py:370(<lambda>)
        1    0.000    0.000   26.615   26.615 ydiff.py:668(main)
        1    0.000    0.000   26.623   26.623 ydiff.py:1(<module>)
        1    0.000    0.000    0.001    0.001 difflib.py:1(<module>)
       49    0.000    0.000    0.000    0.000 difflib.py:879(_plain_replace)
       62    0.000    0.000    0.000    0.000 difflib.py:874(_dump)
        1    0.000    0.000   26.615   26.615 ydiff.py:652(entry_wrapper)
       16    0.000    0.000    0.000    0.000 ydiff.py:54(<genexpr>)
        1    0.000    0.000    0.000    0.000 difflib.py:44(SequenceMatcher)
        1    0.000    0.000    0.000    0.000 ydiff.py:359(__init__)
        1    0.000    0.000    0.001    0.001 ydiff.py:182(mdiff)
        1    0.000    0.000    0.000    0.000 ydiff.py:233(parse_hunk_meta)
        1    0.000    0.000    0.000    0.000 ydiff.py:681(_process_args)
        1    0.000    0.000    0.000    0.000 difflib.py:1666(HtmlDiff)
        1    0.000    0.000    0.000    0.000 ydiff.py:357(DiffMarker)
        1    0.000    0.000    0.000    0.000 ydiff.py:211(UnifiedDiff)
        1    0.000    0.000    0.000    0.000 difflib.py:1303(ndiff)
        1    0.000    0.000    0.000    0.000 ydiff.py:199(_get_old_text)
        1    0.000    0.000    0.000    0.000 ydiff.py:167(Hunk)
        1    0.000    0.000    0.000    0.000 difflib.py:724(Differ)
        1    0.000    0.000    0.000    0.000 ydiff.py:35(Color)
        1    0.000    0.000    0.000    0.000 ydiff.py:202(_get_new_text)
        1    0.000    0.000    0.000    0.000 ydiff.py:169(__init__)
        1    0.000    0.000    0.000    0.000 ydiff.py:369(<lambda>)
        2    0.000    0.000    0.000    0.000 ydiff.py:213(__init__)
        1    0.000    0.000    0.000    0.000 difflib.py:810(__init__)
        3    0.000    0.000    0.000    0.000 ydiff.py:264(is_common)
        1    0.000    0.000    0.000    0.000 ydiff.py:651(trap_interrupts)
        1    0.000    0.000    0.000    0.000 ydiff.py:366(<lambda>)
        1    0.000    0.000    0.000    0.000 ydiff.py:673(PassThroughOptionParser)
        1    0.000    0.000    0.000    0.000 ydiff.py:279(DiffParser)
        1    0.000    0.000    0.000    0.000 ydiff.py:367(<lambda>)
        1    0.000    0.000    0.000    0.000 ydiff.py:281(__init__)
        1    0.000    0.000    0.000    0.000 ydiff.py:732(<listcomp>)

ymattw avatar Jun 19 '24 19:06 ymattw