less icon indicating copy to clipboard operation
less copied to clipboard

Some PUA characters don't show

Open takumadnl opened this issue 2 years ago • 1 comments

Hi!

Using less, some Private Use Area characters don't show.

OS: macOS 12.5.1 Monterey less: v590

ss_echo_unicodes

Unicode - Private Use Area

Range Name
E000 - F8FF Private Use Area
F0000 - FFFFD Supplementary Private Use Area-A
100000 - 10FFFD Supplementary Private Use Area-B

It seem that PUA characters are treat as Binary. PUA should not be treat as Binary Because Unicode specification don't define its use purpose. I would expect that PUA characters display as it is.

FWIW

Click to expand

Running following script, there seems to be a problem with range definition of Binary too.

#!/bin/bash
#   D800 -   DBFF: High-Surrogate
#   DC00 -   DFFF: Low-Surrogate
#   E000 -   F8FF: Private Use Area
#   F900 -   FAFF: Cjk Compatibility Ideograph
#  EFFFE -  EFFFF: noncharacters
#  F0000 -  FFFFD: Supplementary Private Use Area-A
#  FFFFE -  FFFFF: noncharacters
# 100000 - 10FFFD: Supplementary Private Use Area-B
# 10FFFE - 10FFFF: noncharacters
test_chars=(
  "DFFE"
  "DFFF"   "E000"   "E001"   # PUA start boundary
  "F8FE"   "F8FF"   "F900"   # PUA end   boundary
  "EFFFF"  "F0000"  "F0001"  # SPUA-A start boundary
  "FFFFC"  "FFFFD"  "FFFFE"  # SPUA-A end   boundary
  "FFFFF"  "100000" "100001" # SPUA-B start boundary
  "10FFFC" "10FFFD" "10FFFE" # SPUA-B end   boundary
)

function print_unicodes() {
  for c in ${test_chars[@]}
  do
    printf "%6s: \\U${c}\\n" $c
  done
}

echo "- without less"
print_unicodes

echo
echo "- with less"
print_unicodes | less --quit-if-one-screen

note: using nerd-fonts for screenshot.

takumadnl avatar Aug 22 '22 11:08 takumadnl

Well, since Unicode does not define the characteristics of PUA characters, it's not possible to determine the printable size of each character. Any PUA character could be a normal one-space printable character, or it could be a combining or control character (zero width) or a double-width character, or anything else. Treating them as binary seems the safest as far as maintaining the screen display correctly. However I see your point that in most cases the user would want the characters to display directly. Perhaps there could be an extension to the LESSCHARDEF syntax that would allow the user to specify how each PUA character should be treated.

gwsw avatar Aug 22 '22 17:08 gwsw

Commit dc4fa8c8c47dce999b9fdbd841f16b503b7d8632 adds environment variable LESSUTFCHARDEF that can be used to set the type of Private Use (or any) characters. Note that prior to this change there was a bug where only the two characters U+E000 and U+F8FF were treated as control characters, but the intention was for all characters numerically between them to be similarly treated. This has been fixed, so it is now necessary to set LESSUTFCHARDEF if any PUA characters are to be treated as printable,

gwsw avatar Sep 25 '22 03:09 gwsw