cli icon indicating copy to clipboard operation
cli copied to clipboard

Util::normalizeLineEndings breaks UTF-8

Open worried-networking opened this issue 4 years ago • 6 comments

Hi!

First of all, thank you for all the libraries you've built, they're magnificent! :)

I've encountered an error when working with \SebastianFeldmann\Git\Command\Diff\Compare\FullDiffList: it goes crazy when the diff has binary/UTF-8 data. After some digging I found out the reason: it's all because of \SebastianFeldmann\Cli\Output\Util::normalizeLineEndings. For example, the following text will be processed incorrectly: text: хо — the last two symbols are Cyrillic letters, but here's the result of the method:

php > var_dump(\SebastianFeldmann\Cli\Output\Util::normalizeLineEndings('text: хо'));
string(10) "text: �
о"

It could be enough to replace the regex pattern with ~(BSR_ANYCRLF)*\R~u — at least it fixes my case, but I'm not sure about the possible side effects.

worried-networking avatar Dec 15 '21 16:12 worried-networking

Oh you are absolutely right the u modifier should be there. If I'm not mistaken this should not have any side effects. I'll release a new version in a couple of minutes

sebastianfeldmann avatar Dec 20 '21 13:12 sebastianfeldmann

Awesome, thanks! I think by the side effects I mean how it will modify the behaviour of the code, and whether it should be treated as a breaking change or not.

worried-networking avatar Dec 20 '21 14:12 worried-networking

And as always, it seems simple but it isn't especially with encoding. Of course adding the u breaks some other tests.

I'll investigate and see what I can find ;)

sebastianfeldmann avatar Dec 20 '21 14:12 sebastianfeldmann

Just in case — I had the same problem not only with Cyrillic, but with binary data as well. The first time I noticed the problem was when I run the FullDiffList over a vendored captainhook — the diff went crazy over this file: https://github.com/captainhookphp/captainhook/blob/main/tools/phive. But that time I didn't really paid attention, because, well, it was binary data :)

worried-networking avatar Dec 20 '21 15:12 worried-networking

The unicode/utf8 stuff proves to be a pain in the *** again and again. It seems PHP interprets some cyrillic letters as line breaks if you don't specify the u modifier to tell PHP that it receives an UTF-8 string. But using this modifier all the time breaks ASCII and ISO handling.

I added a dirty hack to detect if I have to use the modifier or not. For now that's fine but I have to investigate that further to come up with a more sustainable solution.

Never the less I released a new version 3.4.1 with the fix. composer update should fix the issue :)

If you need me to compile another PHAR for CaptainHook including the fix just let me know.

sebastianfeldmann avatar Dec 20 '21 15:12 sebastianfeldmann

That's enough for me right now, thanks! I hope you will be able to find a better solution later.

worried-networking avatar Dec 20 '21 15:12 worried-networking