coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

od incorrectly handles non ascii chars

Open seb128 opened this issue 5 months ago • 2 comments

Trying to rebuld groff on current Ubuntu devel with coreutils-uutils leads to tests errors https://launchpadlibrarian.net/798977884/buildlog_ubuntu-questing-amd64.groff_1.23.0-9_BUILDING.txt.gz

The problem seems to be due to the "od" handling of non ascii chars

Basic example

$ od --version
od (GNU coreutils) 9.5

$ echo '’' | LC_ALL=C od -t c
0000000 342 200 231  \n
0000004

but

$ od --version
od 0.0.30

$echo '’' | LC_ALL=C od -t c
0000000   ’  **  **  \n
0000004

Longer version from groff showing a similar issue on other chars

$ echo '!#$%&()*+,./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_ abcdefghijklmnopqrstuvwxyz{|} neutral double quote: " closing single quote: ’ hyphen: ‐ backslash: \ modifier circumflex: ˆ opening single quote: ‘ modifier tilde: ˜ end output' | LC_ALL=C od -t c
0000000   !   #   $   %   &   (   )   *   +   ,   .   /   0   1   2   3
0000020   4   5   6   7   8   9   :   ;   <   =   >   ?   @       A   B
0000040   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R
0000060   S   T   U   V   W   X   Y   Z   [   ]   _       a   b   c   d
0000100   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t
0000120   u   v   w   x   y   z   {   |   }       n   e   u   t   r   a
0000140   l       d   o   u   b   l   e       q   u   o   t   e   :    
0000160   "       c   l   o   s   i   n   g       s   i   n   g   l   e
0000200       q   u   o   t   e   :     342 200 231       h   y   p   h
0000220   e   n   :     342 200 220       b   a   c   k   s   l   a   s
0000240   h   :       \       m   o   d   i   f   i   e   r       c   i
0000260   r   c   u   m   f   l   e   x   :     313 206       o   p   e
0000300   n   i   n   g       s   i   n   g   l   e       q   u   o   t
0000320   e   :     342 200 230       m   o   d   i   f   i   e   r    
0000340   t   i   l   d   e   :     313 234       e   n   d       o   u
0000360   t   p   u   t  \n
0000365

vs

$ echo '!#$%&()*+,./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_ abcdefghijklmnopqrstuvwxyz{|} neutral double quote: " closing single quote: ’ hyphen: ‐ backslash: \ modifier circumflex: ˆ opening single quote: ‘ modifier tilde: ˜ end output' | LC_ALL=C od -t c
0000000   !   #   $   %   &   (   )   *   +   ,   .   /   0   1   2   3
0000020   4   5   6   7   8   9   :   ;   <   =   >   ?   @       A   B
0000040   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R
0000060   S   T   U   V   W   X   Y   Z   [   ]   _       a   b   c   d
0000100   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t
0000120   u   v   w   x   y   z   {   |   }       n   e   u   t   r   a
0000140   l       d   o   u   b   l   e       q   u   o   t   e   :    
0000160   "       c   l   o   s   i   n   g       s   i   n   g   l   e
0000200       q   u   o   t   e   :       ’  **  **       h   y   p   h
0000220   e   n   :       ‐  **  **       b   a   c   k   s   l   a   s
0000240   h   :       \       m   o   d   i   f   i   e   r       c   i
0000260   r   c   u   m   f   l   e   x   :       ˆ  **       o   p   e
0000300   n   i   n   g       s   i   n   g   l   e       q   u   o   t
0000320   e   :       ‘  **  **       m   o   d   i   f   i   e   r    
0000340   t   i   l   d   e   :       ˜  **       e   n   d       o   u
0000360   t   p   u   t  \n
0000365

seb128 avatar Jun 17 '25 09:06 seb128

I'd like to look into fixing this. Obviously, the issue is with multibyte UTF-8 characters, but I don't know yet where the bug is located.

tgrez avatar Jun 17 '25 12:06 tgrez

sure, please go ahead :)

sylvestre avatar Jun 17 '25 14:06 sylvestre