html-differ memory leaks

when i try it with > 100kb files, it almost always gets to this message:

FATAL ERROR: JS Allocation failed - process out of memory

it also runs long time, before showing this error (few minutes). Would it be possible to process diff with multiple child processes. Every child would process part of html, use less memory and we would utilize multiple CPUs (since all servers nowadays have few processors).

Or some other memory handling, to prevent leaks? Otherwise, great code, very usefull!

Sep 09 '14 22:09 phrazer

Actually, I've tried it with > 100kb files and it works.

But I suppose that your problem concerns the recursive algorithm which html differ uses. The recursion depth is too big, so the tool fails.

In the version 1.0.0 I do not use any recursion, so I think that html-differ will work much faster and do not break at your case.

It would be great if you give me an example of a file which now fails.

Sep 10 '14 04:09 eGavr

Or you can even help me and test the prototype of version 1.0.0 )

Checkout the branch support/1.0.0 (be careful! I've broken the API) and test it on your files.

Sep 10 '14 04:09 eGavr

I tried to install version 1.0.0 but it always says its running 0.5 when you check with -v (version)

Also it runs long and then dies. What is the procedure to install version 1.0.0. correctly? I downloaded branch support/1.0.0 and then went to that folder and executed "npm install ."

I can send you files, can you give me your email, or something to comunicate? Best,

Sep 10 '14 08:09 phrazer

Yep, version 1.0.0 is in developing mode.

You should clone the branch support/1.0.0 from github, go to the cloned folder and run

npm install

bin/html-differ --help

my email: [email protected]

Sep 10 '14 08:09 eGavr

Hmm strange, it still doesnt work. And you email doesnt work either, it says no such user. best,

Sep 10 '14 09:09 phrazer

Send here. I'll think about this problem.

[email protected]

Sep 10 '14 09:09 eGavr

I sended. I think it gots problems with bigger htmls because like you said recursive algorithm. But with 1.0.0 it doesnt seem to be any diffrent. Maybe we could split html to few slices, and process them separatly.

Sep 10 '14 09:09 phrazer

Thank you! I'll figure it out as soon as possible!

Sep 10 '14 10:09 eGavr

OK, thanks.

If I can help in anyway, please tell me. Best regards,

Sep 10 '14 13:09 phrazer

I've understood why this happens.

It is so many diffs in the given HTMLs, so the work of the module jsdiff (I use it as dependence in html-differ) becomes really slow! Yes! It works, but the process of the comparison is so long that the Operating System thinks that the program fails and stop it in order not to break the work of OS in general.

I've opened an issue in jsdiff == > issue34

P.S. If you run html-differ on files > 100kb and they will have not such many diffs - the tool will not fail. (even if you use version 0.5.0)

Sep 10 '14 13:09 eGavr

Hmmm,

The problem is that a lot of smaller files (>40+50 kb) also failes. If jsdiff adds maximum loop, we didnt solve the problem, because we dont show all diffs. The files I sended are real life webpages, so they should make problems.

Maybe the solution would be to split html to chunks and parallel process this chunks. Just an idea. Im more into php, so dont have very good knowledge about node.js.

Best regards,

Sep 10 '14 13:09 phrazer

The problem is not in sizes of HTMLs, but in the amount of diffs between them.

How can you split the HTML? What criteria to use? If you split it , you can break it and the comparison will be not fair.

The idea of the differ was to compare not absolutely different HTMLs.

Is it necessary to compare HTMLs if they are absolutely different and you can see the diffs without any tools?

Sep 10 '14 13:09 eGavr

As the variant. jsdiff adds maximum loop.

You see not all diffs, fix them or do something else what you want and run html-differ again.

Sep 10 '14 13:09 eGavr

I've found the solution! Seems, it is the best!

In current implementation of html-differ the module jsdiff compares texts by words (there are several mods: charDiff, wordDiff, lineDiff ...). If we call jsdiff in mode lineDiff, the amount of diffs becomes less, so jsDiff does not fails.

Solution:

This mode of jsdiff will be configurable in html-differ through an option, so you can choose how to compare HTMLs (by chars, byWords, byLines).

Does this variant suit you?

Sep 10 '14 14:09 eGavr

Hey,

Great :+1: Is the result still the same if you compare line by line instead of word by word? Because of classes, and other things you meantioned in README?

How can I apply this patch, what file should I change to enable this, so I can test some more files? :)

Sep 10 '14 14:09 phrazer

I am going to do this in version 1.0.0. There are some moments concerning this option which should be handled.

Sep 10 '14 16:09 eGavr

If I will compare texts by lines, the option ignoreWhitespaces will work in another way - it will ignore all whitespaces except \ns. If more than two \ns go successive, they will turn into one \n.

For example, if you set the option which compares files by lines, ignoreWhitespaces will work in the following way:

<span>


</span>

turns to

<span>
</span>

Does this variant suit you?

Sep 10 '14 18:09 eGavr

Yep, it should do it, as long as both are changed same way. Commit changes asap, and I will test with few diffrent htmls.

I had another idea, what if we first check line by line, and then if lines are the same go on, if not we also check word by word that line. Maybe then you get best from both worlds - speed and great diff tool. Just an idea, dont know if its possible to make it.

Sep 10 '14 19:09 phrazer

Hmmm...

Not bad, not bad...

I'll think about it and fix it ASAP!

Sep 10 '14 19:09 eGavr

Great,

Thanks for fast replay. I will prepare diffrent (sizes, tags, etc...) html examples files and try and debug.

Best regards,

Sep 10 '14 20:09 phrazer

I will do the implementation of a new option (my solution), it should solve your problem too.

And after that in minor version I 'll think about the enhancement that you have proposed, because I also need ASAP something working ) and after release of 1.0.0, I'will think about speed and quality enhancements.

Created an issue concerning your idea == > change the comparison algorithm

Sep 10 '14 20:09 eGavr

Yeah, I agree with you :+1: :+1:

Sep 10 '14 20:09 phrazer

I can not understand, why do you need differ in tasks when you can find diffs without any tools? I tried to implement the idea of comparison by lines, but there are still so many diffs that jsdiff fails.

Also see the comments about you ideas concerning the new algoritm of comparison ==> #95

Sep 11 '14 10:09 eGavr

Because I really like your tool, and it was working great. I dont understand, why you dont want it to work on bigger files too. Can you just add option to choose between lines, words as option? I tried to do it myself but could find what function is calling diff.js. Thanks,

Sep 11 '14 10:09 phrazer

It is not a problem to compare texts by lines!

I've implemented this option, but it will not help you.

Because as I've already said you that even in this mode! there are so many diffs that jsdiff fails.

When we compare by words, we receive more than 50000 diffs - it is too big for jsdiff. And even when we compare be line, we recieve more than 6000 diffs - it is very big result for jsdiff too.

Sep 11 '14 10:09 eGavr

Hmm, I see. In my tests no problem to 2000 - 3000, then it stops. Could we optimise something else?

Sep 11 '14 10:09 phrazer

It seems No! It is absolutely the task of jsdiff(

Sep 11 '14 10:09 eGavr

html-differ html-differ copied to clipboard

memory leaks

html-differ
html-differ copied to clipboard