Match using normalized tokens, then apply the result with the original ones
Current behaviour
The original text is
Draft dated 19 November 2024
The modified text is
Draft dated 19 November 2024
John Doe comments 19 November 2024
The resulting opcodes
[('equal', 0, 4, 0, 4), ('insert', 4, 4, 4, 11), ('equal', 4, 5, 11, 12)]
This translates to the following redline
Draft dated 19 November <ins>2024 </ins>
<ins>John Doe comments 19 November </ins>2024
This seems quite non-intuitive. The addition is at the end, not in the middle.
Proposed solution
The problem arises because the tokenization process tokenizes the two strings as below:
['Draft ', 'dated ', '19 ', 'November ', '2024']
['Draft ', 'dated ', '19 ', 'November ', '2024 ', '¶ ', 'John ', 'Doe ', 'comments ', '19 ', 'November ', '2024']
As 2024<space> does not match 2024 we are left with the strange matching behaviour.
This PR ensures we use normalized tokens (without spaces) for opcode generation but then use the original tokens for further processing.
The result is a more intuitive redline, even though I'm not sure about the value of the <ins></ins> here.:
Draft dated 19 November 2024<ins></ins>
<ins>John Doe comments 19 November 2024</ins>
Thanks for this! Let me take some time to read over it (looks good to merge though)
So what do you think? Using it with this patch now, seems to work.
@houfu good idea, accepted.
Let me check the actions!