redlines Match using normalized tokens, then apply the result with the original ones

Current behaviour

The original text is

Draft dated 19 November 2024

The modified text is

Draft dated 19 November 2024
John Doe comments 19 November 2024

The resulting opcodes

[('equal', 0, 4, 0, 4), ('insert', 4, 4, 4, 11), ('equal', 4, 5, 11, 12)]

This translates to the following redline

Draft dated 19 November <ins>2024 </ins>
<ins>John Doe comments 19 November </ins>2024

This seems quite non-intuitive. The addition is at the end, not in the middle.

Proposed solution

The problem arises because the tokenization process tokenizes the two strings as below:

['Draft ', 'dated ', '19 ', 'November ', '2024']
['Draft ', 'dated ', '19 ', 'November ', '2024 ', '¶ ', 'John ', 'Doe ', 'comments ', '19 ', 'November ', '2024']

As 2024<space> does not match 2024 we are left with the strange matching behaviour.

This PR ensures we use normalized tokens (without spaces) for opcode generation but then use the original tokens for further processing.

The result is a more intuitive redline, even though I'm not sure about the value of the <ins></ins> here.:

Draft dated 19 November 2024<ins></ins>
<ins>John Doe comments 19 November 2024</ins>

Mar 10 '25 04:03 rickythefox

Thanks for this! Let me take some time to read over it (looks good to merge though)

Mar 10 '25 12:03 houfu

So what do you think? Using it with this patch now, seems to work.

May 06 '25 08:05 rickythefox

@houfu good idea, accepted.

May 19 '25 06:05 rickythefox

Let me check the actions!

May 19 '25 07:05 houfu