learn_gnuawk icon indicating copy to clipboard operation
learn_gnuawk copied to clipboard

suggestion : counting chars vs. counting bytes

Open mogando668 opened this issue 3 years ago • 2 comments

"
awk 'length($1) < 6' table.txt
echo 'αλεπού' | awk '{print length()}'
echo 'αλεπού' | awk -b '{print length()}'
echo 'αλεπού' | LC_ALL=C awk '{print length()}'"

one doesn't need to use LC_ALL=C or activate byte mode -b just to count exact bytes of the input.

even in gawk unicode mode, use


- length(str)  

  to count UTF8 characters, and 
  
- match(str, /$/) - 1 

   to count bytes

Why that works is that the code is requesting a match of the empty string at the tail, but since no other characters were matched along the way, it defaults to reporting back to you a byte count. The minus 1 is essential because otherwise RSTART would be at 1 virtual byte beyond the input string.

You can directly throw binary files like .MP3 .MP4 .XZ .PNG and gawk unicode mode would give you the byte count, without any error messages

That said, only the match( ) one won't give error messages if you throw binary data at gawk unicode mode, length( ) will DEFINITELY scream, as well as match(str /.$/)

  1. (note the dot . right before $ - on valid UTF8 inputs, this function call style is equivalent to length( ), but on random bytes, it will DEFINITELY give you the locale error message )

(can't use this to circumvent length( )'s error message if it's pure binary input - one needs to code up an alternative approach to count it, e.g. via gsub( )

Took me a while to code it up myself , but now i could get byte-mode to count UTF8, and get unicode mode to directly take in binary data, and have it report an identical count to gnu-wc)

mogando668 avatar Mar 21 '22 18:03 mogando668

I'm surprised by this:

$ echo 'αλεπού' | awk '{print match($0, /$/)}'
13
$ echo 'αλεπού' | awk '{print match($0, /λ/)}'
2
$ echo 'αλεπού' | awk '{print match($0, /ε/)}'
3

Will have to look it up, and probably add an example/note for the next version of the book. Thanks.

learnbyexample avatar Mar 23 '22 13:03 learnbyexample

You’re most welcomed. I only accidentally discovered this myself a year ago - something that wasn’t mentioned officially in gawk’s documentation

Regards, Jason K

Sundeep Agarwal @.***>於2022年3月23日 09:29寫道:

 I'm surprised by this:

$ echo 'αλεπού' | awk '{print match($0, /$/)}' 13 $ echo 'αλεπού' | awk '{print match($0, /λ/)}' 2 $ echo 'αλεπού' | awk '{print match($0, /ε/)}' 3 Will have to look it up, and probably add an example/note for the next version of the book. Thanks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

mogando668 avatar Mar 28 '22 02:03 mogando668

Added a note in version 2.0 with a link to this issue.

learnbyexample avatar Aug 22 '23 03:08 learnbyexample