exec-maven-plugin icon indicating copy to clipboard operation
exec-maven-plugin copied to clipboard

useMavenLogger breaks nōn-ASCII characters (i.e. 99.9̅% of Unicode)

Open mirabilos opened this issue 5 years ago • 1 comments

I saw this in one of my scripts, but I reduced the script to just…

#!/bin/sh
echo mäh
exit 0

… for the reproduction of this.

$ mvn org.codehaus.mojo:exec-maven-plugin:exec@build-depsrcs@build-depsrcs -Dexec.useMavenLogger=false
[INFO] Scanning for projects...
[INFO]
[INFO] --------------------< org.evolvis.tartools:csvfile >--------------------
[INFO] Building org.evolvis.tartools:csvfile 3.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- exec-maven-plugin:3.0.0:exec (build-depsrcs) @ csvfile ---
mäh
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.262 s
[INFO] Finished at: 2020-06-21T17:48:46+02:00
[INFO] ------------------------------------------------------------------------

… vs…

$ mvn org.codehaus.mojo:exec-maven-plugin:exec@build-depsrcs@build-depsrcs -Dexec.useMavenLogger=true
[INFO] Scanning for projects...
[INFO] 
[INFO] --------------------< org.evolvis.tartools:csvfile >--------------------
[INFO] Building org.evolvis.tartools:csvfile 3.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- exec-maven-plugin:3.0.0:exec (build-depsrcs) @ csvfile ---
[INFO] [main] mᅢᄂh
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.685 s
[INFO] Finished at: 2020-06-21T17:49:00+02:00
[INFO] ------------------------------------------------------------------------

Watch it completely destroy the umlaut.

IMHO, the conversion between script output and string passed to the logger SHOULD use the locale encoding, and if that is not possible or ASCII, it MUST use UTF-8.

Cc @hankolerd

mirabilos avatar Jun 21 '20 15:06 mirabilos

The wide characters in question are U+FFC3 and U+FFA4, which is what happens when you use the line encoding of UTF-8 (\xC3\xA4), read it byte-for-byte and (wrong) sign-extend it to Unicode.

@hankolerd

mirabilos avatar Jun 21 '20 15:06 mirabilos

Which version contains the fix? (From a user’s PoV, it’s better to keep bugreports open until they can actually install a fixed version.)

But thanks for fixing it.

mirabilos avatar Jul 10 '23 23:07 mirabilos

@mirabilos It has been released in 3.1.1

jebeaudet avatar Nov 20 '23 22:11 jebeaudet

OK.

As a testcase, trying to write a latin-1 mäh first, then a UTF-8 mäh.

With LC_ALL=C.UTF-8:

[INFO] m�h
[INFO] mäh

With LC_ALL=C:

[INFO] m?h
[INFO] m??h

So it’s definitely interpreting the bytes into wide characters then converting them back to current-locale multibyte characters. This is precisely the follow-up bug I already warned about… but it’s an improvement from the situation before, at least.

mirabilos avatar Nov 22 '23 17:11 mirabilos

As Slawomir said, comments on commits are easily missed, I never saw that thread.

As for the issue, I'll repeat what he said that you can do a PR with a tentative fix.

jebeaudet avatar Nov 22 '23 17:11 jebeaudet

Hi - this issue is also closed .... so if something is still wrong please:

  • create new issue with description and better reproduce steps
  • create PR with fix proposition

comments with closed issue can also be missed

slawekjaranowski avatar Nov 22 '23 19:11 slawekjaranowski