less
less copied to clipboard
Some PUA characters don't show
Hi!
Using less, some Private Use Area characters don't show.
OS: macOS 12.5.1 Monterey less: v590
data:image/s3,"s3://crabby-images/42247/42247ac443cf8eca430833ba6f27da24bf72144b" alt="ss_echo_unicodes"
Unicode - Private Use Area
Range | Name |
---|---|
E000 - F8FF | Private Use Area |
F0000 - FFFFD | Supplementary Private Use Area-A |
100000 - 10FFFD | Supplementary Private Use Area-B |
It seem that PUA characters are treat as Binary. PUA should not be treat as Binary Because Unicode specification don't define its use purpose. I would expect that PUA characters display as it is.
FWIW
Click to expand
Running following script, there seems to be a problem with range definition of Binary too.
#!/bin/bash
# D800 - DBFF: High-Surrogate
# DC00 - DFFF: Low-Surrogate
# E000 - F8FF: Private Use Area
# F900 - FAFF: Cjk Compatibility Ideograph
# EFFFE - EFFFF: noncharacters
# F0000 - FFFFD: Supplementary Private Use Area-A
# FFFFE - FFFFF: noncharacters
# 100000 - 10FFFD: Supplementary Private Use Area-B
# 10FFFE - 10FFFF: noncharacters
test_chars=(
"DFFE"
"DFFF" "E000" "E001" # PUA start boundary
"F8FE" "F8FF" "F900" # PUA end boundary
"EFFFF" "F0000" "F0001" # SPUA-A start boundary
"FFFFC" "FFFFD" "FFFFE" # SPUA-A end boundary
"FFFFF" "100000" "100001" # SPUA-B start boundary
"10FFFC" "10FFFD" "10FFFE" # SPUA-B end boundary
)
function print_unicodes() {
for c in ${test_chars[@]}
do
printf "%6s: \\U${c}\\n" $c
done
}
echo "- without less"
print_unicodes
echo
echo "- with less"
print_unicodes | less --quit-if-one-screen
data:image/s3,"s3://crabby-images/8601f/8601f6088b06041cff966a4dbd4ce6e89c4156f3" alt=""
note: using nerd-fonts for screenshot.
Well, since Unicode does not define the characteristics of PUA characters, it's not possible to determine the printable size of each character. Any PUA character could be a normal one-space printable character, or it could be a combining or control character (zero width) or a double-width character, or anything else. Treating them as binary seems the safest as far as maintaining the screen display correctly. However I see your point that in most cases the user would want the characters to display directly. Perhaps there could be an extension to the LESSCHARDEF syntax that would allow the user to specify how each PUA character should be treated.
Commit dc4fa8c8c47dce999b9fdbd841f16b503b7d8632 adds environment variable LESSUTFCHARDEF that can be used to set the type of Private Use (or any) characters. Note that prior to this change there was a bug where only the two characters U+E000 and U+F8FF were treated as control characters, but the intention was for all characters numerically between them to be similarly treated. This has been fixed, so it is now necessary to set LESSUTFCHARDEF if any PUA characters are to be treated as printable,