coreutils
coreutils copied to clipboard
Error: Windows stdio in console mode does not support writing non-UTF-8 byte sequences
Issue:
It seems 0.10 coreutils head/tail doesn't gracefully handle outputting UTF-16 content in Windows console. This is somewhat annoying since in PowerShell (versions prior to PowerShell 6.0), the redirection operators (>/>>) default to writing files in UTF-16LE.
Error: head.exe: error writing 'standard output': Windows stdio in console mode does not support writing non-UTF-8 byte sequences
PS C:\Users\User> echo "foobar" > test
PS C:\Users\User> cat test
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ File: test <UTF-16LE>
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ foobar
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────
PS C:\Users\User> head test
C:\Users\User/.dotfiles/opt/bin\head.exe: error writing 'standard output': Windows stdio in console mode does not support writing non-UTF-8 byte sequences
PS C:\Users\User> tail test
thread 'main' panicked at src\uu\tail\src\tail.rs:431:33:
called `Result::unwrap()` on an `Err` value: Error { kind: InvalidData, message: "Windows stdio in console mode does not support writing non-UTF-8 byte sequences" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
PS C:\Users\User> echo $Host
Name : ConsoleHost
Version : 5.1.26100.4768
InstanceId : 1c7962ab-05f7-417d-a559-f5dd21650b9a
UI : System.Management.Automation.Internal.Host.InternalHostUserInterface
CurrentCulture : en-US
CurrentUICulture : en-US
PrivateData : Microsoft.PowerShell.ConsoleHost+ConsoleColorProxy
DebuggerEnabled : True
IsRunspacePushed : False
Runspace : System.Management.Automation.Runspaces.LocalRunspace
--==-- Testing with https://github.com/JavaScriptDude/cygtail (Cygwin) looks like it has gracefully output.
PS C:\Users\User\Desktop\cygtail-20220415> echo "foobar" > test
PS C:\Users\User\Desktop\cygtail-20220415> ./tail.exe test
□□foobar
Anticipated Result:
head/tail gracefully process files created by simple redirection output (e.g. echo "foobar" > test). Not sure if that means inserted unknown character glyphs or similar or converting on the fly to work within the presumed UTF-8 limit of windows console.
This is a consequence of Rust's standard library needing to convert to UTF-16 for console output and without knowing the encoding, UTF-8 is always assumed. The restriction on the encoding could be relaxed (e.g. by being lossy) but that requires changes in the standard library unless uutils wants to workaround it by manually doing console output.
I think lossy console output is fine as long as false positives for console detection are rare OR we can always override what it does.
That said I’m not sure relying on the "graceful" thing like Cygwin is the best way, since the two squares look an awful lot like a byte order mark to me. It’s likely undesirable on other systems, but on Windows the BOM is/was king and we should probably do very lightweight encoding detection with it. I do hope it dies but Powershell 2.x as a preinstall isn’t going away.h
~~We can force windows to use UTF-8 stdin/stdout for our application. Here is an example code for that:~~
fn force_utf8_on_windows() {
use windows_sys::Win32::Foundation::TRUE;
use windows_sys::Win32::Globalization::{CP_UTF8, IsValidCodePage};
use windows_sys::Win32::System::Console::{SetConsoleCP, SetConsoleOutputCP};
// SAFETY: argument passed by value and result is checked
if unsafe { IsValidCodePage(CP_UTF8) } != TRUE {
return;
}
// set stdin codepage
// SAFETY: argument passed by value and result is checked
if unsafe { SetConsoleCP(CP_UTF8) } != TRUE {
return;
};
// set stdout codepage
// SAFETY: argument passed by value and result is checked
if unsafe { SetConsoleOutputCP(CP_UTF8) } != TRUE {
return;
};
}
~~This way Windows handles the BOM mark and (I assume) encoding by itself:~~
PS C:\__git\coreutils> .\target\debug\coreutils.exe head test
foobar
~~This sets chcp for the whole console host session, so we may need to get previous codepage via GetConsoleCP and set it back on exit.~~
~~Scratch all of that, when running in powershell it caches the codepage at the session start.
Changes to codepoint from SetConsole*CP allows console to handle BOM, but UTF-16 content doesn't work anyway.~~
Here's an example how to put raw bytes to stdout via win32 api for cygwin-like behaviour. (this fails to print non-ascii stuff properly) raw_stdout.txt
TLDR: Conhost expects UTF-8 output, so the only way to correctly output to console (without squares and with proper encoding of non-ascii stuff) is to convert whatever we are given to UTF-8.
User side workaround for profile.ps1 (forces all ps applets to write in utf-8):
$PSDefaultParameterValues['*:Encoding'] = 'utf8'
This one might also be needed on older versions (forces powershell/stdin/stdout to use utf-8, default in pscore6+):
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
Added: This also affects cat