diff-match-patch-c Still active?

Is this project still active?

May 07 '13 12:05 sebfischer83

Good question. I suppose no and yes, but probably more no than yes.

When I started this, my goal was to get it to a point that it could be a libxdiff replacement for the main project I work on (i.e libgit2), but I never really got it to that point and it has become lower priority on my todo list.

That being said, I'm still interested and if there was someone else also interested, I'd be happy to work together to add functionality and get it in a better state.

It looks like there is one fork where some further work has been done (https://github.com/ggreer/diff-match-patch-c) and I'd be happy to merge pull requests if @ggreer wants to submit them!

May 07 '13 12:05 arrbee

Hi,

thanks for your quick reply. I'm interested in helping a bit but I have to say my C knowledge is not very huge... I'm coming from the .Net and Objective-C world more. But if I can help a bit I would try some stuff.

May 07 '13 14:05 sebfischer83

Instead of using this library, I ended up embedding Lua into my C program and using the Lua version of diff-match-patch.

My changes to diff-match-patch-c were just minor things like fixing the build on Ubuntu 12.04. I didn't add any features or fix any tricky bugs.

May 07 '13 15:05 ggreer

@ggreer Thanks for your fast answer, the problem is I need a plain C solution so Lua isn't an option for me.

May 07 '13 15:05 sebfischer83

I wouldn't mind helping complete this library. Though my experience is mainly centered around python and java, I do know how to read C, and figure that working on a project such as this would be a good learning experience.

I mainly need this library for integration into a python/pypy application. Currently I'm making heavy use of the pure python version of diff-match-patch for a text collaboration server, and the overhead of the library leaves something to be desired - with CPython, the library is too slow, and with pypy the library is too memory intensive.

I would greatly appreciate continued work on this project. If nothing else, could you please comment/explain:

Which parts of DMP are currently implemented, and which parts aren't.
What currently does and doesn't work.
What dmp_pool and the other secondary structures are/do.

Edit: Oh, and how unicode-compatible is this library currently?

Oct 02 '13 20:10 Varriount

@Varriount Sure thing! I'm glad you're interested.

Of the diff-match-patch code, so far I've only implemented the basic string-to-string diff and even for that, I've omitted little bits here and there. The core Myers diff is implemented as are many of the optimizations, but things like deadlines and such are not. None of the match and patch code is implemented at all.

Based on the very limited tests that I've written, I think the basic diff works. Nothing else is implemented so that's pretty simple. The dmp_options struct is largely copied from the upstream code base and almost none of the actual option values are hooked up.

The core diff code generates a dmp_diff object. Internally, that sorts a singly linked list of dmp_node structures which represent the spans in the data (i.e. a range of shared bytes, a range of inserted bytes, a range of deleted bytes). The nodes are a little funky compared to a tradition linked list implementation because instead of storing a next pointer, they store the index of the next element. This is possible because the nodes are actually allocated as one big block of data instead of using individual allocations - the big block of data is the dmp_pool. It is mainly there to provide efficient allocation of the small dmp_node structures.

In retrospect, the dmp_pool stuff is probably not written the way I would implement it if I were starting now, but at this point, cleaning that up is not really the highest priority issue in this code. It does serve to keep the number of actual allocations quite low and the memory usage fairly efficient.

Regarding unicode compatibility, the current version of this code diffs based on byte ranges. None of the upstream diff cleanups are implemented (such as converting the byte-range diffs to line-oriented diffs). I believe that handling unicode should be done as a post byte-diff cleanup, realigning diff span boundaries to match unicode character boundaries. That being said, I haven't really looked at that deeply nor am I much of an expert on unicode.

By the way, I may have a reason to pick this code back up again and move it forward some more. It certainly helps to know that there is still some interest.

Oct 02 '13 21:10 arrbee

@Varriount I'm still interested too in this project maybe there is a chance to move it forward.

Oct 03 '13 02:10 sebfischer83

If it helps, I have a Windows 8, 64 bit dev machine, with both Visual Studio and Mingw64 installed.

I'm currently trying to get Visual Studio to compile the source into a DLL (Surprisingly, Mingw64's gcc compiled it without a hitch, usually it's the other way around.)

Oct 03 '13 03:10 Varriount

I did a little bit of tweaking to the project organization - hopefully it won't mess you up too much.

Right now, the coding conventions are very similar to those of libgit2 because that's what I spend much of my time working on and it was just easy to stick with those. If you want to help, you may want to read the conventions for that project.

Oct 03 '13 06:10 arrbee