jsondiffpatch icon indicating copy to clipboard operation
jsondiffpatch copied to clipboard

String Difference Performance

Open jdscolam opened this issue 8 years ago • 13 comments

So I wanted to ask about the expected performance on string diffing. I love your library, but I have run into some really strange behavior and I am wondering if it is "normal."

I am running a performance test, and I have a set of seven randomly generated 50-word "Lorem Ipsum" texts on objects that only have a single property containing the text. I then create 10,000 different random "diffs" by picking a left and right side at random from my set of seven. Right now I am seeing performance of just over 400 diffs/second in a simple node.js mocha test. This result surprised me since even complex objects with arrays are performing at over 50,000 diffs/second. Is what should be expected, or am I missing something?

jdscolam avatar Feb 27 '16 20:02 jdscolam

string diffing is performed by an external lib Neil Fraser's diff-match-patch, it does a rather standard text diff, but char by char (instead of line by line likea typical git diff does). I'm not familiar with its implementation but I suppose it uses LCS, the same algorithm this lib does for arrays. that would mean diffing a string of N chars would be approximately as expensive as diffing an array of N items (if items in the array are atomic values). so, that might be very expensive compared to diffing some objects, it depends. I don't have a benchmark suite, there has been others asking for one and I agree it would be nice. I haven't seen myself in big need of one, because the main usage I considered involves diffing information editable by a user on a browser, which sets a limit on the json size. if you have any recommendations I would love to start thinking in a benchmark suite, for monitoring performance changes across versions.

benjamine avatar Feb 27 '16 22:02 benjamine

it might worth exploring making the text diff over words instead of chars (reducing diff granularity, but winning speed). you could achieve that by splitting the string into an array of words.

benjamine avatar Feb 27 '16 22:02 benjamine

Oooh, splitting it into words might work, or might be a nice option in jsondiffpatch's settings ;). Actually, the more I think about it though, you'd still have to deal with the array issues, which might kill all your speed gains.

As for benchmarks I'm just using the 'performance-now' package to determine the amount of time the single .diff line takes. A full-on benchmark suite would actually be super useful for my particular use-case, which is why I am manually writing out performance tests. I need to know what I can "get away with" per second before determining when to scale. Right now I'm just focused on diff, patch, and diff+patch performance.

jdscolam avatar Feb 28 '16 00:02 jdscolam

@benjamine So I have an update here. After some work to ensure that the strings given to the algorithm were more realistic (10k comparisons with texts that have somewhat minor differences), I'm running about 3.5k diffs per second on a high-end PC (my dev machine). I'm not sure if this is expected or not, but it certainly is an interesting number.

jdscolam avatar Mar 12 '16 21:03 jdscolam

thanks! looks good, I'll keep this open as a reminder for future reference.

benjamine avatar Mar 17 '16 13:03 benjamine

Hey @jdscolam interesting comments on string performance. I built a basic benchmarking suite to test jsondiffpatch against 2 other libs. This lib does really well (and has more features).

https://github.com/justsml/json-diff-performance

Currently I'm only testing the speed of diff'ing static JSON objs - and including the size of the outputs.

If you have a minute I'd love to see a PR to enhance the test suite. :wink:

@benjamine - Great job on a feature-rich AND performant library! Do you have any feedback for my benchmark lib? I thought I saw a comment a while ago about adding standardized benchmarks, but I couldn't find it now..

Thanks!

justsml avatar Mar 19 '16 21:03 justsml

@justsml Very neat! My only comment would be to add some meaningful string difference to the performance suite. I've added an issue to that effect. Maybe we can chat about how I did benchmarking on the strings to get it "close" to a real-world test.

jdscolam avatar Mar 24 '16 04:03 jdscolam

@justsml that's pretty neat, haven't seen it before, a fair comparison of multiple libs is probably the best for users. as @jdscolam I also have other test cases to suggest, I'll open it as an issue in that repo. I admit in the design of this lib I have always put mantainability and flexibility over performance (so that's good news!), I never put time at finding perf bottlenecks to improve, but I'd love to, as soon as those first 2 attributes aren't degraded. My initial thought has been that if you need to diff big graphs, very fast:

  • it's unlikely you need that on a browser / client app
  • if you're on a server, and your needs justify it, you might do better using a more CPU-performant language, like C, C#, Rust, even... Go?, which you can invoke from node (and any other stack) using native bindings. If you do that perhaps you can rely on asm.js/PNaCl to get that working on some browsers.
  • if lot of your time is spent on the text diffs, on the server, we can replace google diff_match_patch by another really fast native lib and add it to jsondiffpatch as a plugin (I forgot that's possible)

benjamine avatar Mar 27 '16 23:03 benjamine

@benjamine what's that really fast native lib? I'd love to know. Or even, being able to inject a compliant diff_match_patch into jsondiffpatch (i.e. a call to a service to do the text diff extremely fast, then returning the result to the lib). Thoughts?

jdscolam avatar Mar 29 '16 02:03 jdscolam

oh I don't know about one, I'm just assuming there must be a few. but the simplest test I can think of is piping the text thru the diff tool available on your OS (OSX or linux) using node.js child_process (not as fast as a native binding, but maybe good if strings are big)

benjamine avatar Mar 29 '16 02:03 benjamine

nice, here's a C++ port of the same Neil Fraser's lib https://github.com/QuentinFiard/diff_match_patch you should be able to call that from node using https://github.com/node-ffi/node-ffi but I have never done that.

benjamine avatar Mar 29 '16 03:03 benjamine

Hey @benjamine @jdscolam I just wanted to post an update over here.

I upgraded my benchmark lib: https://github.com/justsml/json-diff-performance Check the README, I added some screenshots of the new results. I'll try get more tests whipped up by next week.

Thanks again for the feedback!!!

justsml avatar Mar 29 '16 06:03 justsml

Neil Fraser's library is using the Myers diff algorithm, which has running time O(N*D) (where D is the size of the difference), rather than O(N^2) like the traditional dynamic programming approach.

So if you make many changes to a text, D will grow large, which makes the diff slow.

The unix diff utility operates on whole lines, rather than characters, which lowers D and N by a constant factor.

toomim avatar Dec 17 '17 01:12 toomim