heckel-diff icon indicating copy to clipboard operation
heckel-diff copied to clipboard

moves and replaces?

Open gastonmorixe opened this issue 8 years ago • 16 comments

Hi,

I love this algorithm but I think it is missing moves and replace/update operations.

You can check it here https://github.com/mcudich/HeckelDiff and here https://github.com/tapirdata/wson-diff

Any plan of adding them?

Thank you

gastonmorixe avatar Nov 15 '17 06:11 gastonmorixe

Ah, sorry. I wrote this to compensate for some bad edge cases with the normal levenshtein-type algorithm, so the main idea was just to replace that with something with better performance that wouldn't lock up a browser. (I also wanted to try out, but didn't really get to completing, an idea about addressing the in-between flaw with a suffix trie)

I don't really plan on doing much with it, since there are so many diff packages, but could be convinced -- is there a reason you want to use this one instead of the repos you linked?

myndzi avatar Nov 15 '17 06:11 myndzi

(I do suppose this is a bit of false advertising and, with a name like this, I should probably do something about that)

myndzi avatar Nov 15 '17 06:11 myndzi

hey @myndzi thanks for the fast reply. Well, I couldn't compile wson-diff and the other is in swift. I need it in js for react native. :) Would be awesome to have it.

gastonmorixe avatar Nov 15 '17 07:11 gastonmorixe

do you know any other implementation that has move and replaces in js? maybe other algorithm. Thanks

gastonmorixe avatar Nov 15 '17 07:11 gastonmorixe

Ah, I see! What kind of API would be most useful? I confess to not putting any thought into what the API would look like in such a case, so I'm not sure how I would go about extending the current interface. I linked a bunch of stuff that I found (none of which suited my particular need, but some of which might suit yours) at the bottom of the readme. I'm not sure I could turn this around quickly, but I can certainly look into it this weekend or something.

myndzi avatar Nov 15 '17 07:11 myndzi

(I'm also not quite sure what is meant by replaces... I just scanned the paper to confirm my recollection, but I think this technique only describes inserts, deletions, and moves. What defines a "replace"? Would that just be the combination of deletes and inserts?)

myndzi avatar Nov 15 '17 08:11 myndzi

cool. I love the swift version API

let o = [1, 2, 3, 3, 4]
let n = [2, 3, 1, 3, 4]
let result = diff(o, n)
// [.move(1, 0), .move(2, 1), .move(0, 2)]

let o = [0, 1, 2, 3, 4, 5, 6, 7, 8]
let n = [0, 2, 3, 4, 7, 6, 9, 5, 10]
let result = diff(o, n)
// [.delete(1), .delete(8), .move(7, 4), .insert(6), .move(5, 7), .insert(8)]

gastonmorixe avatar Nov 15 '17 08:11 gastonmorixe

So perhaps the current approach could be extended by adding a type 'move' with the old position (from), the new position (to), and the value? (The current package was designed for a React application also, it was helpful to return exactly the sequence of text data that I would be rendering rather than resolve the diff data against some other data -- it was intended to be rendered directly)

myndzi avatar Nov 15 '17 08:11 myndzi

I've been thinking about this a bit, and the goal has essentially been to take in a list of tokens and output a list of data in the order/sequence you would render them. (Essentially, given 'left' (old text) and 'right' (new text), output an annotated 'right')

Creating a patch (a list of operations to apply to the input) is probably going to be fundamentally incompatible with much of the current API. It's probably fairly simple, however, to mark on the output whether each token was new (hasn't been seen) or moved (has been seen somewhere else), and perhaps have a cleanup pass over this to remove redundant information, such as "moves" that don't affect the outcome. Would it be sufficient for what you need to mark tokens as "moved from" (for deletes) and "moved to" (for inserts)?

Additionally, for update/change, you can consider a sequence of deletes and inserts to be a change; this is fairly easy also to simplify

myndzi avatar Nov 19 '17 22:11 myndzi

I think it will. For example say you have [“a”, “b”] and [“b”, “a”] I just want to know that 0 moved to 1, not that any of those were inserted or deleted, so I can animate appropriately.

Thank you.

On Nov 19, 2017, at 7:43 PM, Kris Reeves [email protected] wrote:

I've been thinking about this a bit, and the goal has essentially been to take in a list of tokens and output a list of changes in the order/sequence you would render them. Creating a patch (a list of operations to apply to the input) is probably going to be fundamentally incompatible with much of the current API. It's probably fairly simple, however, to mark on the output whether each token was new (hasn't been seen) or moved (has been seen somewhere else), and perhaps have a cleanup pass over this to remove redundant information, such as "moves" that don't affect the outcome. Would it be sufficient for what you need to mark tokens as "moved from" (for deletes) and "moved to" (for inserts)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

gastonmorixe avatar Nov 20 '17 00:11 gastonmorixe

In this case, it might look something like:

[ { type: 'ins', value: 'b', moveFrom: 2 },
  { type: 'same', value: 'a' },
  { type: 'del', value: 'b', moveTo: 1 } ]

I'm not really sure that suits your use-case in this situation :\

myndzi avatar Nov 20 '17 00:11 myndzi

It could if I can map reduce that to the changes types (add, delete, moves, eventually updates too). I am not sure I understand the output ‘same’ there. Why 3 results and not one move?

Sent from my iPhone

On Nov 19, 2017, at 9:48 PM, Kris Reeves [email protected] wrote:

In this case, it might look something like:

[ { type: 'ins', value: 'b', moveFrom: 2}, { type: 'same', value: 'a' }, { type: 'del', value: 'b', moveTo: 1 } ] I'm not really sure that suits your use-case in this situation :\

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

gastonmorixe avatar Nov 20 '17 02:11 gastonmorixe

As I mentioned, the goal with this package (for me) was to input a set of data (left, right ~= before, after) and output something that would directly describe what I wanted to render on screen (some sequence of changes which would essentially describe the output text and how to render/format it). This differs from outputting a patch, or a series of instructions to change left into right. This kind of influenced how I structured the code at a basic level, so implementing something like the Swift output becomes difficult -- that assumes you have retained a reference to the input array of tokens and will be operating on that in conjunction to the data returned by the diff. At some level, that is not too bad here, but it doesn't play nicely with the functionality here that is based around re-chunking the input ("hybrid" diff), nor the function that calculates the differences themselves.

As to 'move' specifically, it's a bit tricky to define that. It might look easy if you just say "why not a move?", but the question gets a bit complicated if you're trying to define the minimal set of changes. If you go from ["a", "b"] to ["b", "a"], what moved? Valid answers are "a moved, and b stayed the same", "b moved, and a stayed the same" or "they both moved". This gets uglier with more complicated diffs.

With the Heckel approach, you might say that "anything which uniquely maps from the left data to the right data can be considered to have moved", but in that case if you have ["a", "b"] -> ["a", "b"], they've both "moved". So then you get to "well, if it didn't change position then it didn't move". What, then, do you do with ["a", "b"] -> ["c", "a", "b"] ? Did "a" and "b" move, or did "c" get inserted and nothing else changed?

Choices I made in this code make this harder, and this should answer your question about what "same" is: the output of this package is "the set of data you might render to describe the differences between the two data sets". In other words, if you concatenate all the "same" and "del" items, you get the left data, and if you concatenate all the "same" and the "ins" items, you get the right data.

Moves get trickier, too, because the index value has multiple meanings: is the index where the token was located in the left data set? Is it the index where it was located in the right data set? Is it the index of where you would splice the token to in order to produce right from left? Each of these is different.

For what you are trying to do, which is apparently "animate the transformation from left to right", I don't think I can easily provide something worthwhile, since the assumptions I've made differ from what you need: what you need is a patch, or a list of instructions that transform left into right; what I've written this package to calculate is not this set of instructions, but rather the result of them. As a consequence, it's easy for me to annotate the data with "what was the location of the things I've calculated as uniquely equivalent", but hard for me to produce a patch without completely changing the purpose/intent of the package.

It is probably not hard to fork this repository and provide a different set of assumptions that serve your own needs, but I'm not certain that it's worthwhile to try and extend it to serve these two different purposes without a bit of an overhaul. I think what you need is basically a different processDiff function, one which does fewer and different things.

At this point in the code, we've got two arrays of tokens (left and right), each of which has references to the other indicating which tokens (or sequences of tokens) are uniquely paired. The output is then processed into some sane form which describes what those changes mean in terms of how the left changed to become the right. If it helps, I could expose this data directly, but you'd likely have more processing to perform than is ideal. I could provide another function which creates a minimal(ish?) patch from here, but I'm not confident it can be made to compose well with the other functionality, or what form it should take.

myndzi avatar Nov 20 '17 04:11 myndzi

Thanks for that detailed and long response. I appreciate it. Honestly I haven’t even read the Heckel paper, I just seen a talk where Instagram said they used it for their new ListView kind library and said it was really efficient O(N).

Because the way I am coding the animation engine for a ListView in React Native, I don’t think I really care a lot about who moved to where. What I really think I need is that the moves are more counted or marked as inserts or deletes. This is because React will just keep track of them by their key/id and my engine will calculate new positions after each data change. Therefore if something move in reality just means that this elements are not going to be unmounted and their position will be updated. I calculate all positions for each data change.

In case of [a,b] -> [c,a,b] I think I want to visually see c being inserted and a, b moving so I am ok with either only a “a” insert or a “a” insert and “a and a + b moves”. Again I don’t really care about moves, I just don’t want them marked neither as inserts or deletes.

So I think maybe what you said earlier, if you add “movedTo” key, I might filter them out from the “inserts and deletes changes list” and that might just work.

Thank you again

Sent from my iPhone

On Nov 20, 2017, at 1:05 AM, Kris Reeves [email protected] wrote:

As I mentioned, the goal with this package (for me) was to input a diff and output something that would directly describe what I wanted to render on screen (some sequence of changes which would essentially describe the output text and how to render/format it). This differs from outputting a patch, or a series of instructions to change left into right. This kind of influenced how I structured the code at a basic level, so implementing something like the Swift output becomes difficult -- that assumes you have retained a reference to the input array of tokens and will be operating on that in conjunction to the data returned by the diff. At some level, that is not too bad here, but it doesn't play nicely with the functionality here that is based around re-chunking the input ("hybrid" diff), nor the function that calculates the differences themselves.

As to 'move' specifically, it's a bit tricky to define that. It might look easy if you just say "why not a move?", but the question gets a bit complicated if you're trying to define the minimal set of changes. If you go from ["a", "b"] to ["b", "a"], what moved? Valid answers are "a moved, and b stayed the same", "b moved, and a stayed the same" or "they both moved". This gets uglier with more complicated diffs.

With the Heckel approach, you might say that "anything which uniquely maps from the left data to the right data can be considered to have moved", but in that case if you have ["a", "b"] -> ["a", "b"], they've both "moved". So then you get to "well, if it didn't change position then it didn't move". What, then, do you do with ["a", "b"] -> ["c", "a", "b"] ? Did "a" and "b" move, or did "c" get inserted and nothing else changed?

Choices I made in this code make this harder, and this should answer your question about what "same" is: the output of this package is "the set of data you might render to describe the differences between the two data sets". In other words, if you concatenate all the "same" and "del" items, you get the left data, and if you concatenate all the "same" and the "ins" items, you get the right data.

Moves get trickier, too, because the index value has multiple meanings: is the index where the token was located in the left data set? Is it the index where it was located in the right data set? Is it the index of where you would splice the token to in order to produce right from left? Each of these is different.

For what you are trying to do, which is apparently "animate the transformation from left to right", I don't think I can easily provide something worthwhile, since the assumptions I've made differ from what you need: what you need is a patch, or a list of instructions that transform left into right; what I've written this package to calculate is not this set of instructions, but rather the result of them. As a consequence, it's easy for me to annotate the data with "what was the location of the things I've calculated as uniquely equivalent", but hard for me to produce a patch without completely changing the purpose/intent of the package.

It is probably not hard to fork this repository and provide a different set of assumptions that serve your own needs, but I'm not certain that it's worthwhile to try and extend it to serve these two different purposes without a bit of an overhaul. I think what you need is basically a different processDiff function, one which does fewer and different things.

At this point in the code, we've got two arrays of tokens (left and right), each of which has references to the other indicating which tokens (or sequences of tokens) are uniquely paired. The output is then processed into some sane form which describes what those changes mean in terms of how the left changed to become the right. If it helps, I could expose this data directly, but you'd likely have more processing to perform than is ideal. I could provide another function which creates a minimal(ish?) patch from here, but I'm not confident it can be made to compose well with the other functionality, or what form it should take.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

gastonmorixe avatar Nov 20 '17 04:11 gastonmorixe

So what you really care about is distinguishing whether a token was present in both the data sets?

myndzi avatar Nov 20 '17 07:11 myndzi

That’s a better way of seeing it. Yeah. And that might be actually way easier to know.

gastonmorixe avatar Nov 20 '17 13:11 gastonmorixe