diffr
diffr copied to clipboard
Not encoding-aware, operates on individual bytes
As far as I can tell, it seems like diffr
is not aware of encodings in any way and simply operates on and compares individual bytes. (At least I haven’t found any way to specify an encoding.)
This leads to issues with multi-byte encodings such as UTF-8 (not to mention “wide” encodings like UTF-16 or UTF-32) where diffr
will split up bytes belonging to a single encoded character when (for example) the byte sequences for a changed character have a common prefix, so that diffr
will insert different highlighting markup for these constituent characters.
Here is an example:
θ
α
[0m--- 1.txt 2023-08-04 13:01:35.764424632 -0400[0m [0m+++ 2.txt 2023-08-04 13:01:42.592564739 -0400[0m [0m@@ -1 +1 @@[0m [0m[31m-[0m[0m[31m�[0m[0m[1m[38;2;255;255;255m[41m�[0m[0m[31m[0m [0m[32m+[0m[0m[32m�[0m[0m[1m[38;2;255;255;255m[42m�[0m[0m[32m[0m
or, as displayed by less
in my terminal:
that is correct, and a well-known issue.
Handling bytes is sufficient for my needs, since I don't really often use unicode in the codebases that I work with. The rationale behind the original design is that we can have different encodings, and there is no way to know which one applies to which case.
An easy way to be in a better state than today would be to introduce a mode that assumes the input to be utf8, and process it accordingly; I can provide some code pointers, and I don't think that it is a super hard code change, but I am unlikely to make the change myself.
@mookid I see, that’s fair enough of course! I didn’t see anything in the README or help message to suggest this, though, so in the short term, it might be nice to add a brief note mentioning it?
So far it hasn’t bothered me enough to do much about it either (beyond opening this issue), but maybe it will in the future :grinning:
I can provide some code pointers, and I don't think that it is a super hard code change, but I am unlikely to make the change myself.
@mookid I'd be willing to look into that :innocent:
Hi @bojidar-bg, thanks for volunteering and give the issue a try!
Here is a plan to add support for encoding aware coloring:
- support only utf-8, assume the encoding is always this (more work would be needed to support alternatives, without a clear way to guess the encoding of the input)
- most changes should probably happen to the function
tokenize()
, which turns the byte string into a list of spans to color; it should be changed to avoid splitting the tokens at the middle of a graphene cluster; I don't think that more changes are needed.
Please let me know if you need more help.
#100 works fine for my needs (Cyrillic), codepoints in the output are gone and changing tokens are highlighted correctly.
Thanks @bojidar-bg.