less icon indicating copy to clipboard operation
less copied to clipboard

Support UTF-16

Open libove opened this issue 2 months ago • 4 comments

This was in the legacy "open feature requests" file, but I don't find it discussed in any Issue - open or otherwise - here on github, nor do I find any workarounds by searching the Internet in general. I imagine it must have been discussed, and that I'm just being inadequate this morning, so if this is already discussed elsewhere, please slap me appropriately with a pointer to where, thanks ...

I use less (from Cygwin, presently version "less 668 (PCRE regular expressions) on Windows 11, both at the native 'DOS' command prompt and within a Cygwin bash shell), and I find that many files generated by Windows native programs seem to be UTF-16 encoded, which less doesn't seem to be able to handle, e.g.:

$ file TodoBackupService.exe_tbc.log
TodoBackupService.exe_tbc.log: Unicode text, UTF-16, little-endian text, with CRLF line terminators

$ less -X TodoBackupService.exe_tbc.log
"TodoBackupService.exe_tbc.log" may be a binary file.  See it anyway?
<FF><FE>C^@:^@\^@P^@r^@o^@g^@r^@a^@m^@ ^@F^@i^@l^@e^@s^@ ^@(^@x^@8^@6^@)^@\^@E^@a^@s^@e^@U^@S^@\^@T^@o^@d^@o^@ ^@B^@a^@c^@k^@u^@p^@\^@b^@i^@n^@\^@T^@o^@d^@o^@B^@a^@c^@k^@u^@p^@S^@e^@r^@v^@i^@c^@e^@.^@e^@x^@e^@_^@t^@b^@c^@.^@l^@o^@g^@^M^@

Whereas:

$ more TodoBackupService.exe_tbc.log
C:\Program Files (x86)\EaseUS\Todo Backup\bin\TodoBackupService.exe_tbc.log
***
**
(2025-08-14 15:54:26:982)18284/11972: TBCTrans_GetProcAddress CallS3.dll hMod=700D0000, pFn=700D1B00 lastError=0

In the manual page I find the possibility to instruct less to expect one or another character encoding with the example of $ export LESSCHARSET=utf-8

.. but that still produces:

$ export LESSCHARSET=utf-8
$ less -X TodoBackupService.exe_tbc.log
"TodoBackupService.exe_tbc.log" may be a binary file.  See it anyway?
<FF><FE>C^@:^@\^@P^@r^@o^@g^@r^@a^@m^@ ^@F^@i^@l^@e^@s^@ ^@(^@x^@8^@6^@)^@\^@E^@a^@s^@e^@U^@S^@\^@T^@o^@d^@o^@ ^@B^@a^@c^@k^@u^@p^@\^@b^@i^@n^@\^@T^@o^@d^@o^@B^@a^@c^@k^@u^@p^@S^@e^@r^@v^@i^@c^@e^@.^@e^@x^@e^@_^@t^@b^@c^@.^@l^@o^@g^@^M^@

Whereas attempting: $ export LESSCHARSET=utf-16

.. produces:

$ less -X TodoBackupService.exe_tbc.log
invalid charset name

So, maybe less really still does not support UTF-16? If so, I'd like to re-add the legacy feature request for less to support UTF-16, or, if there is a practical workaround for situations like the one I've described above ("be able to read UTF-16 log and other files generated by many Windows native programs"), a pointer would be appreciated.

libove avatar Oct 13 '25 06:10 libove

Alas, you have to use iconv -f UTF-16 -t UTF-8 TodoBackupService.exe_tbc.log | less.

polluks avatar Oct 13 '25 11:10 polluks

Correct, UTF-16 is not supported. The man page says that UTF-8 "is the only character set that supports multi-byte characters." You could use a LESSOPEN script to automatically filter the file through iconv when a UTF-16 file is detected.

gwsw avatar Oct 13 '25 16:10 gwsw

Thanks, okay, next time I run into a UTF-16-encoded file, I'll test the new LESSOPEN stuff I just whipped up. So, in terms of (re-)registering this as a feature request .. does this Issue do that, or is there another formal place to do so? (Even acknowledging that it seems unlikely to happen). cheers,

libove avatar Oct 14 '25 16:10 libove

This issue is fine for the request to support UTF-16 internally. I will leave this open until it is implemented or I decide not to do it.

gwsw avatar Oct 14 '25 17:10 gwsw