delta icon indicating copy to clipboard operation
delta copied to clipboard

🐛 High resource usage on long lines when line truncation is disabled

Open jpalus opened this issue 2 years ago • 14 comments

I've noticed the issue on my Raspberry Pi 2 which is resource constrained and therefore it's easier to spot performance issues. Not really sure if that's expected but a diff of single file with change on single line (12k long admittedly though) takes 23s with 100% CPU (@900MHz) and allocated more than 400MB of RAM.

Reproducer:

# populate test file with very long line key = [ 'value','value','value'.... ];
$ echo -n 'key = [' >> test; for i in `seq 1500`; do echo -n "'value'," >> test; done; echo '];' >> test
# change something on the very long line and save as test2
$ sed "s/];/'value'];/" test > test2
$ /usr/bin/time -v delta --max-line-length=0 test test2
...
        User time (seconds): 20.48
        System time (seconds): 1.71
        Percent of CPU this job got: 95%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.34
...
        Maximum resident set size (kbytes): 442616
...

On a much faster Pinebook Pro:

$ /usr/bin/time -v delta --max-line-length=0 test test2
...
        User time (seconds): 13.53
        System time (seconds): 0.81
        Percent of CPU this job got: 95%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.00
...
        Maximum resident set size (kbytes): 871824
...

version: 0.8.3

jpalus avatar Jul 14 '21 22:07 jpalus

diff-so-fancy appears to be way more efficient, with same input on Raspberry Pi 2:

..
        User time (seconds): 0.52
        System time (seconds): 0.06
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.59
...
        Maximum resident set size (kbytes): 7696

jpalus avatar Jul 24 '21 22:07 jpalus

Hi @jpalus, thanks for opening this. Syntax-highlighting long lines is slow (as documented in the warning accompanying --max-line-length=0), and delta accordingly truncates such lines by default. But seeing as your test file doesn't involve syntax highlighting I think you're right that the performance should be better. In your real use cases, when you want to view a 12k line without truncation, would you want syntax-highlighting?

--max-line-length <max-line-length>
    Truncate lines longer than this. To prevent any truncation, set to zero. Note that syntax-highlighting very
    long lines (e.g. minified .js) will be very slow if they are not truncated [default: 512]

dandavison avatar Jul 25 '21 08:07 dandavison

Nope no need for syntax highlighting in that case, actually I don't care about syntax highlighting in diffs all that much in general.

I'm aware of note regarding performance with unlimited line length but personally I prefer to have unlimited line lengths and slower diff performance however these numbers seem excessively slow. Though I just might be unaware of complexity involved here.

jpalus avatar Jul 25 '21 16:07 jpalus

In fact, it's not the syntax highlighter that is slow here; it's the within-line diff computation. (If you're interested in the details, this runs a dynamic programming algorithm similar to the standard Levenshtein / edit distance algorithm, which involves a square matrix of dimensions #tokens x #tokens. So, I think it's understandable that this is slow on very long lines.)

Delta doesn't currently have a way of disabling the within-line edit inference. There would certainly be ways forwards here -- alternative diff algorithms, or perhaps adjustments to the current algorithm to handle very long lines.

I've updated the help text to stop it saying that the problem is the syntax highlighter. https://github.com/dandavison/delta/pull/698 ensures that no syntax highlighting call at all will be made when none is needed (e.g. user did not request syntax or language unknown).

dandavison avatar Aug 22 '21 01:08 dandavison

diff-so-fancy has better performance on long lines because it uses the diff-highlight perl script, which has a deliberately simplistic within-line diff algorithm:

It will find the common prefix and suffix of two lines, and consider everything in the middle to be "different".

dandavison avatar Aug 22 '21 01:08 dandavison

Just an idea but perhaps as intermediate solution new option could be introduced with threshold for within-line diff computation? For lines with length lower than threshold delta would use exact computation while above either no within-line diff would be calculated at all or, if possible, the simplistic solution similar to diff-highlight could be implemented?

jpalus avatar Aug 24 '21 16:08 jpalus

Some other idea for an optimization though I not really sure whether it makes any sense or whether it is doable :). It comes from an observation that a lot of line modifications change a small portion of this line, leaving it mostly the same, so what about quickly finding common prefix and common suffix between two lines and running exact algorithm only against parts with common prefix/suffix stripped? Finding common prefix/suffix should be very quick so additional time spent on this should be negligible, however it should greatly improve common case. But that's only with an assumption that it's doable :).

jpalus avatar Dec 04 '21 09:12 jpalus

what about quickly finding common prefix and common suffix between two lines and running exact algorithm only against parts with common prefix/suffix stripped?

Hi @jpalus, that's an interesting idea. I wonder however whether your earlier suggestion would be just as good and easier to implement:

For lines with length lower than threshold delta would use exact computation while above [...] the simplistic solution similar to diff-highlight could be implemented?

Do you happen to have any interest in implementing it? We have an ARCHITECTURE.md doc now that gives an overview of the relevant codepaths and basically points to how/where a second diff algorithm would be added.

dandavison avatar Dec 04 '21 20:12 dandavison

@dandavison I'm still trying to navigate the new documentation site, but I can't figure out how to enable more than 512 chars for when delta is being used for git diff/git log. Appreciate the explanation for its n^2 runtime behavior. I think the default of 512 is pretty limiting especially when differences actually appear in the truncated zone.

I tried max-line-length = 2048 inside the [delta] section of my git config, but this did not have any effect.

unphased avatar Mar 13 '22 04:03 unphased

Hi @unphased, that's right:

[delta]
   max-line-length = 2048

I just tested and that does work, so something must be wrong with your config file. You could use delta --show-config to sanity check that entries from your git config are being picked up.

I'm still trying to navigate the new documentation site

delta -h and delta --help are guaranteed to document all options. The online manual doesn't document every option (except in that it contains a copy of the --help output). Have you tried the new short help available with delta -h?

I think the default of 512 is pretty limiting especially when differences actually appear in the truncated zone.

Are these long lines you're encountering minified css/js files?

dandavison avatar Mar 13 '22 16:03 dandavison

I think the default of 512 is pretty limiting especially when differences actually appear in the truncated zone.

Are these long lines you're encountering minified css/js files?

Nah. Plenty of situations in which we encounter things like this. markdown like 603b807 37dd3391 from this repo or various config stuff. Often you have config formats which are annoying to, or don't support multiline entries.

unphased avatar Mar 13 '22 18:03 unphased

@dandavison I figured out that the max-line-length config set in my gitconfig does work but only as long as side-by-side is not enabled. Is this a bug that you can confirm?

unphased avatar Mar 13 '22 22:03 unphased

Actually, it's not a bug, since side-by-side is working as documented: in side-by-side mode, you use wrap-max-lines to control how many wrapped lines you want before truncation.

--wrap-max-lines <N>
    How often a line should be wrapped if it does not fit.

    Zero means to never wrap. Any content which does not fit  after wrapping will be truncated. 
    A value of "unlimited" means a line will be wrapped as many times as required.

    [default: 2]

Internally, the side-by-side code sets the max line length to a high enough value to cover the requested number of wrapped lines. The thing that is potentially confusing is that increasing max-line-length doesn't automatically increase wrap-max-lines. But I think that max-line-length is something we're hoping to get rid of entirely once we've made the diff algorithm more efficient on long lines.

dandavison avatar Mar 14 '22 00:03 dandavison

Oh, perfect! Thanks for the explanation :)

unphased avatar Mar 14 '22 01:03 unphased