coreutils
coreutils copied to clipboard
od incorrectly handles non ascii chars
Trying to rebuld groff on current Ubuntu devel with coreutils-uutils leads to tests errors https://launchpadlibrarian.net/798977884/buildlog_ubuntu-questing-amd64.groff_1.23.0-9_BUILDING.txt.gz
The problem seems to be due to the "od" handling of non ascii chars
Basic example
$ od --version
od (GNU coreutils) 9.5
$ echo '’' | LC_ALL=C od -t c
0000000 342 200 231 \n
0000004
but
$ od --version
od 0.0.30
$echo '’' | LC_ALL=C od -t c
0000000 ’ ** ** \n
0000004
Longer version from groff showing a similar issue on other chars
$ echo '!#$%&()*+,./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_ abcdefghijklmnopqrstuvwxyz{|} neutral double quote: " closing single quote: ’ hyphen: ‐ backslash: \ modifier circumflex: ˆ opening single quote: ‘ modifier tilde: ˜ end output' | LC_ALL=C od -t c
0000000 ! # $ % & ( ) * + , . / 0 1 2 3
0000020 4 5 6 7 8 9 : ; < = > ? @ A B
0000040 C D E F G H I J K L M N O P Q R
0000060 S T U V W X Y Z [ ] _ a b c d
0000100 e f g h i j k l m n o p q r s t
0000120 u v w x y z { | } n e u t r a
0000140 l d o u b l e q u o t e :
0000160 " c l o s i n g s i n g l e
0000200 q u o t e : 342 200 231 h y p h
0000220 e n : 342 200 220 b a c k s l a s
0000240 h : \ m o d i f i e r c i
0000260 r c u m f l e x : 313 206 o p e
0000300 n i n g s i n g l e q u o t
0000320 e : 342 200 230 m o d i f i e r
0000340 t i l d e : 313 234 e n d o u
0000360 t p u t \n
0000365
vs
$ echo '!#$%&()*+,./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_ abcdefghijklmnopqrstuvwxyz{|} neutral double quote: " closing single quote: ’ hyphen: ‐ backslash: \ modifier circumflex: ˆ opening single quote: ‘ modifier tilde: ˜ end output' | LC_ALL=C od -t c
0000000 ! # $ % & ( ) * + , . / 0 1 2 3
0000020 4 5 6 7 8 9 : ; < = > ? @ A B
0000040 C D E F G H I J K L M N O P Q R
0000060 S T U V W X Y Z [ ] _ a b c d
0000100 e f g h i j k l m n o p q r s t
0000120 u v w x y z { | } n e u t r a
0000140 l d o u b l e q u o t e :
0000160 " c l o s i n g s i n g l e
0000200 q u o t e : ’ ** ** h y p h
0000220 e n : ‐ ** ** b a c k s l a s
0000240 h : \ m o d i f i e r c i
0000260 r c u m f l e x : ˆ ** o p e
0000300 n i n g s i n g l e q u o t
0000320 e : ‘ ** ** m o d i f i e r
0000340 t i l d e : ˜ ** e n d o u
0000360 t p u t \n
0000365
I'd like to look into fixing this. Obviously, the issue is with multibyte UTF-8 characters, but I don't know yet where the bug is located.
sure, please go ahead :)