diffr Not encoding-aware, operates on individual bytes

As far as I can tell, it seems like diffr is not aware of encodings in any way and simply operates on and compares individual bytes. (At least I haven’t found any way to specify an encoding.)

This leads to issues with multi-byte encodings such as UTF-8 (not to mention “wide” encodings like UTF-16 or UTF-32) where diffr will split up bytes belonging to a single encoded character when (for example) the byte sequences for a changed character have a common prefix, so that diffr will insert different highlighting markup for these constituent characters.

Here is an example:

1.txt:

θ

2.txt:

α

diff -u 1.txt 2.txt | diffr:

[0m--- 1.txt  2023-08-04 13:01:35.764424632 -0400[0m
[0m+++ 2.txt  2023-08-04 13:01:42.592564739 -0400[0m
[0m@@ -1 +1 @@[0m
[0m[31m-[0m[0m[31m�[0m[0m[1m[38;2;255;255;255m[41m�[0m[0m[31m[0m
[0m[32m+[0m[0m[32m�[0m[0m[1m[38;2;255;255;255m[42m�[0m[0m[32m[0m

or, as displayed by less in my terminal:

Screenshot from 2023-08-04 13-16-05

Aug 04 '23 17:08 Socob

that is correct, and a well-known issue.

Handling bytes is sufficient for my needs, since I don't really often use unicode in the codebases that I work with. The rationale behind the original design is that we can have different encodings, and there is no way to know which one applies to which case.

An easy way to be in a better state than today would be to introduce a mode that assumes the input to be utf8, and process it accordingly; I can provide some code pointers, and I don't think that it is a super hard code change, but I am unlikely to make the change myself.

Aug 06 '23 12:08 mookid

@mookid I see, that’s fair enough of course! I didn’t see anything in the README or help message to suggest this, though, so in the short term, it might be nice to add a brief note mentioning it?

So far it hasn’t bothered me enough to do much about it either (beyond opening this issue), but maybe it will in the future :grinning:

Aug 08 '23 18:08 Socob

I can provide some code pointers, and I don't think that it is a super hard code change, but I am unlikely to make the change myself.

@mookid I'd be willing to look into that :innocent:

Mar 11 '24 14:03 bojidar-bg

Hi @bojidar-bg, thanks for volunteering and give the issue a try!

Here is a plan to add support for encoding aware coloring:

support only utf-8, assume the encoding is always this (more work would be needed to support alternatives, without a clear way to guess the encoding of the input)
most changes should probably happen to the function tokenize(), which turns the byte string into a list of spans to color; it should be changed to avoid splitting the tokens at the middle of a graphene cluster; I don't think that more changes are needed.

Please let me know if you need more help.

Mar 28 '24 18:03 mookid

#100 works fine for my needs (Cyrillic), codepoints in the output are gone and changing tokens are highlighted correctly.

Thanks @bojidar-bg.

Mar 31 '24 21:03 wizardist

diffr diffr copied to clipboard

Not encoding-aware, operates on individual bytes

diffr
diffr copied to clipboard