PowerShell icon indicating copy to clipboard operation
PowerShell copied to clipboard

Data passed through a pipe will be parsed differently in PowerShell 5.1 vs current release

Open lclutz opened this issue 2 years ago • 22 comments

Prerequisites

Steps to reproduce

When outputting UTF-8 from a native application PowerShell pipes fail to parse the data correctly even if a BOM is present. In contrast this works in PowerShell 5.1 - that's why I'm reporting this as a bug.

The following python script will produce UTF-8 encoded output with a BOM.

#!/usr/bin/env python3
import sys
sys.stdout.buffer.write("äöüß αβγδ\n".encode("utf_8_sig"))

Powershell 5.1 is able to recognise the output is UTF-8 encoded if the BOM is present. PowerShell 7.2.4 does not.

Note that both shells were started with the -noprofile option to ensure default configuration.

Expected behavior

PS> python .\test.py | echo
äöüß αβγδ

Actual behavior

PS > python .\test.py | echo
´╗┐├ñ├Â├╝├ƒ ╬▒╬▓╬│╬┤

Error details

No response

Environment data

PS> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.2.4
PSEdition                      Core
GitCommitId                    7.2.4
OS                             Microsoft Windows 10.0.19044
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Visuals

No response

lclutz avatar Jun 13 '22 12:06 lclutz

PowerShell relies on System.Diagnostics.Process to parse the output from a pipe and in the absence of an explicit setting of the Standard*Encoding property in the start info it will rely on the global setting of Console.OutputEncoding to determine what encoding is used. On Windows the default console encoding is still whatever the OS is configured to, which is typically 431 on English hosts.

Unfortunately your only workaround here is to set [Console]::OutputEncoding = [System.Text.Encoding]::UTF8 and then run your command

[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
python .\test.py | echo

Even then it's still going to read the BOM as text and not a marker so it's not perfect but you at least get the proper string back.

Also keep in mind not all executables use the console output encoding as the encoding they write to stdout/stderr as. Some completely ignore the setting and always use UTF-8, others will use some other setting somewhere to control this. PowerShell is just using whatever the console encoding is set to (rights so IMO). If you want to deal with UTF-8 everywhere I recommend adding this to your profile

# UTF8Encoding is used instead of the UTF8 field to set a BOM-less encoding for writing to stdin
[Console]::OutputEncoding = [Console]::OutputEncoding = $OutputEncoding = [System.Text.UTF8Encoding]::new($false)

Python is a clear example of a program that doesn't use the console codepage. It actually uses the system locale, which on US English hosts is windows-1252. If in your example you didn't use sys.stdout.buffer but wrote text to sys.stdout then getting back UTF-8 requires even more work on the Python side. You can force it to use a specific encoding with the PYTHONIOENCODING env var or by calling Python with python.exe -X utf8 to use UTF-8, e.g.

$env:PYTHONIOENCODING = 'utf-8'
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
python -c "print('äöüß αβγδ')" | echo

$env:PYTHONIOENCODING = ''  # Unsets env var for test
python -X utf8 -c "print('äöüß αβγδ')" | echo

jborean93 avatar Jun 13 '22 19:06 jborean93

Hey Jordan!

Thank you for your explanation of what's going on behind the scenes.

To be clear I am not really advocating adding a BOM to UTF-8 output. I only did it to signal to PowerShell what encoding I'm using.

I knew about [Console]::OutputEncoding but I viewed setting it more as a workaround to be honest. I think it is preferable to provide as a "marker" for lack of a better term to help PowerShell process the data regardless of user settings over telling the end user of my scripts to change their output encoding before running them.

I wanted the code for repoducing the issue be short but what I am actually doing is checking if the output handle is a console window or not. If it is I set the code page to UTF-8 and output UTF-8 without BOM to avoid the artifact you mentioned. If it is not a console window but a pipe I output UTF-8 with BOM to allow PowerShell to pipe correctly. As I said this works in version 5.1 but not in the current release.

If this difference in the behaviour of the pipes between PowerShell 5.1 and the current release is in fact the intended behaviour feel free to close this issue.

lclutz avatar Jun 13 '22 20:06 lclutz

Unfortunately the code that is reading from the stdout on the new process doesn't utilise the BOM at all. This all happens in dotnet itself and isn't actually controlled by PowerShell. It is interesting that it does work on WinPS which would indicate to me that the change that handled all this occurred in dotnet itself and not PowerShell. By the time the output gets to PowerShell it's already a string so the BOM is lost or at least you've lost enough detail to be able to re-encode it correctly.

If this difference in the behaviour of the pipes between PowerShell 5.1 and the current release is in fact the intended behaviour feel free to close this issue.

I'm not sure if this is intended behaviour or maybe just an unexpected artifact of the dotnet core migration, maybe they did it on purpose and decided to ignore the BOM or maybe it's an unexpected side affect of some other change there.

jborean93 avatar Jun 13 '22 20:06 jborean93

I think it is preferable to provide as a "marker" Unfortunately the code that is reading from the stdout on the new process doesn't utilise the BOM at all. '

When I saw your discussion, I burst into tears. "No one defines the coding standard of powershell standard input and standard output" ----- I think this is the biggest mistake of powershell. A "bom2" similar to the bom header should actually be defined to standardize this.

Let me talk about my thoughts: 1 Design a bom2 as a logo header. The string used to identify the header will be automatically recognized by powershell, translated into the target encoding, and output in the current powershell console.

2 This solution must be able to express emoj on win without relying on win terminal

3Q: What encoding scheme is used? A: I recommend only using utf8. Because of utf16, only 65535 characters can be used. utf32 is actually unused and should be considered obsolete.

4Q: Does it support multiple encodings? A: I recommend: No.

5Q: If multiple encodings are supported, do you want to default base64 encoding and decoding? A: I suggest: yes.

The ultimate purpose of doing these things: support emoj, unified encoding, no garbled characters.


This is the solution I think works so far: base64 encode, decode

#!/usr/bin/env python3
import sys
import base64

aaa = 'write-host "中文"'
bbb = base64.b64encode(aaa.encode("utf_16_le"))
sys.stdout.buffer.write(bbb)
#$base64_cmd = 'dwByAGkAdABlAC0AaABvAHMAdAAgACIALU6HZSIA'
#powershell.exe -EncodedCommand $base64_cmd

There is also a temporary solution: write the information, plus the bom header, to a file. Let powershell read by itself.

kasini3000 avatar Jun 14 '22 03:06 kasini3000

Sorry @kasini3000 but that doesn't make sense to me.

I'm not arguing for or against any BOMs or headers.

There is a difference in behaviour between version 5.1 and current release. I happen to rely on the 5.1 behaviour to get around some encoding issues but that's besides the point. All I'm saying is there is a discrepancy and this might be a bug.

Emojis have nothing to do with it. UTF-16 only being able to represent 65535 characters is not true and I'm not sure what base64 encoding is supposed to do.

Yes, you are right, you can always directly write a file and then read the file back - circumventing the pipe. But passing data through pipes is what this issue is about.


I think the original title of this issue was a bad choice because it emphasised the failure to parse rather than the difference between versions. I have edited the title hoping to prevent future misunderstandings.

lclutz avatar Jun 14 '22 13:06 lclutz

I think the original title of this issue was a bad choice because it emphasised the failure to parse rather than the difference between versions.

~~The encoding difference is documented in Differences between Windows PowerShell 5.1 and PowerShell 7.x.~~ Unrelated

SeeminglyScience avatar Jun 14 '22 15:06 SeeminglyScience

The encoding difference is documented in Differences between Windows PowerShell 5.1 and PowerShell 7.x.

I disagree. Correct me if I'm wrong but $OutputEncoding is a setting that controls how the output of cmdlets will be encoded.

The relevant encoding setting in this case would be [System.Console]::OutputEncoding which controls which encoding PowerShell will expect from native applications.

That setting did not change its default value as far as I'm aware. On my machine it defaults in both versions to ibm850 (this is region specific iirc, it might be a different value for you but it'll be the same value in either PowerShell version).

lclutz avatar Jun 14 '22 16:06 lclutz

I disagree. Correct me if I'm wrong but $OutputEncoding is a setting that controls how the output of cmdlets will be encoded.

Oops you're right it doesn't apply here. That controls the encoding of objects piped to native commands.

SeeminglyScience avatar Jun 14 '22 16:06 SeeminglyScience

I disagree. Correct me if I'm wrong but $OutputEncoding is a setting that controls how the output of cmdlets will be encoded.

Just to clarify $OutputEncoding controls the encoding PowerShell uses to encode strings (if not a string the stringified string of the object) being piped into native applications through stdin. That is how PowerShell will encode "my data" | my.exe. Maybe that's what you meant by what you said but I wasn't fully clear and thought it best to mention.

jborean93 avatar Jun 14 '22 18:06 jborean93

Just to clarify $OutputEncoding controls the encoding PowerShell uses to encode strings (if not a string the stringified string of the object) being piped into native applications through stdin. That is how PowerShell will encode "my data" | my.exe. Maybe that's what you meant by what you said but I wasn't fully clear and thought it best to mention.

Thanks for clarifying. I worded it poorly you are completely right.

lclutz avatar Jun 14 '22 19:06 lclutz

Like the great definition bom, despite the initial denials. But now is the de facto standard.

Come on, hero! we need you!

Come here and define specifications for powershell standard input, and standard output.

"my string eg : ⭐🌛" | my.exe ----Emojis have nothing to do with it?

kasini3000 avatar Jun 15 '22 13:06 kasini3000

Like the great definition bom, despite the initial denials. But now is the de facto standard.

I don't know what you mean by that. It is not at all common for UTF8 encoded data to include a BOM. Even PowerShell changed their default encoding for the Out-File cmdlet to UTF8 without BOM in later versions.

If you want to argue that it would be nice to have a marker indicating the encoding of the file that's fine but this issue is not about text encoding in general - that's a much larger topic and the PowerShell issue tracker probably wouldn't be the place I'd raise it.

"my string eg : ⭐🌛" | my.exe ----Emojis have nothing to do with it?

Whether the payload text includes emojis or not doesn't matter for what I'm talking about.

Look, I'm not trying to say your concerns are not valid. I'd recommend you open issues about those concerns in the relevant places so they can be addressed individually. But I would appreciate it if you could keep the discussion here on topic without broadening the scope.

lclutz avatar Jun 15 '22 21:06 lclutz

PowerShell relies on System.Diagnostics.Process to parse the output from a pipe and in the absence of an explicit setting of the Standard*Encoding property in the start info it will rely on the global setting of Console.OutputEncoding to determine what encoding is used.

Yes, it seems it is a regression in .Net. It makes sense to create simple repro on C# and open new issue in .Net Runtime repository. I guess the regression is in internal AsyncStreamReader class

iSazonov avatar Jun 16 '22 11:06 iSazonov

#10824 might be related

SeeminglyScience avatar Jun 16 '22 15:06 SeeminglyScience

I investigated this issue some more trying to understand the discrepancy in this behavior between WinPS 5.1 and PS 7.2. It turns out this is because PowerShell 7 changed how to read a process' standard output:

  • WinPS uses Process.StandardOutput, which is a StreamReader, while PS 7 uses the OutputDataReceived event handler.

When using OutputDataReceived, the behavior is the same on both .NET and .NET Framework -- BOM is NOT respected and Console.OutputEncoding is used for encoding bytes to string. When using Process.StandardOutput, the behavior is also the same on both .NET and .NET Framework -- BOM is respected by the StreamReader.

Switching to OutputDataReceived was part of a big change back in 2016 to support streaming behavior in a pipeline of native commands. I will let the Engine Group to review this issue again.

daxian-dbw avatar Aug 03 '22 22:08 daxian-dbw

Related #1908

iSazonov avatar Aug 04 '22 04:08 iSazonov

I'm not arguing for or against any BOMs or headers.

I would argue strongly against them in UTF-8 files, for the reasons given in this thread:

https://stackoverflow.com/a/2223926

  • Unicode standard recommends against it

  • Illegal in JSON

  • Corrupts detection of shebang (#!) in shell scripts

  • Complicates binary level issues

    • Empty files no longer 0 bytes long
    • Pure ASCII files no longer all < 128 byte value
    • Raw concatenation puts BOM at each concatenation point
  • Conflates with other text formats and hence cannot be used unambiguously for identification

Stack Overflow
What's different between UTF-8 and UTF-8 without a BOM? Which is better?

AE1020 avatar Aug 04 '22 19:08 AE1020

@daxian-dbw thanks for tech. @AE1020 You can hate the BOM, you can live without it, but need a mechanic similar to the bom, do you have such a mechanic now?

Defining the mechanism is easy, but getting the command line world to obey it is hard. No matter how you define it, there are many people who oppose it, and many programs are incompatible.

vi respect BOM

Have you ever thought deeply about how to make many command line programs coding compatible? What is your advice to do?

kasini3000 avatar Aug 05 '22 05:08 kasini3000

Personally I think just having a runspace specific way to define the encoding used to process the native command output data will be good enough. The [Console]::OutputEncoding works but it's a process wide setting so can get very messy when you start dealing with multiple runspaces in the same thread.

Nothing is going to be foolproof, console encoding is difficult as every native command uses their own rules. Some respect the console codepage, others hardcode to UTF-8, some use UTF-16-LE, and there are also the ones that hardcode to something different altogether. Keeping a sane default (console encoding) but with a safe way to override in the Runspace fits the bill nicely IMO.

jborean93 avatar Aug 05 '22 05:08 jborean93

Have you ever thought deeply about how to make many command line programs coding compatible? What is your advice to do?

@kasini3000 I would have to know the specifics of particular situations to advise how a given program could accomplish its goal without a BOM.

But as a general rule: whatever reason a system feels a signal is needed directly inside a file--the basic approach would be to ask that be stored aside in a table or manifest elsewhere. Or something passed on the command line. The program would consult the table of files (or file patterns) for this information.

You certainly would not want a program like git to ask you to go around editing byte patterns into your files, to say whether they should be ignored or not. That's why there is a .gitignore file out-of-band, it is consulted to get the information required.

Programs that want more information about your generalized files (or data streams) that they cannot guess should not make you edit that data directly. The data was fit for their purpose as is--and in the BOM case, it is documented and understood that many things become unfit by virtue of the modification.

AE1020 avatar Aug 05 '22 08:08 AE1020

@jborean93

# UTF8Encoding is used instead of the UTF8 field to set a BOM-less encoding for writing to stdin
[Console]::OutputEncoding = [Console]::OutputEncoding = $OutputEncoding = [System.Text.UTF8Encoding]::new($false)

I found this very helpful code snippet (thank you for that) and just want to ask you, is there a typo in it, because of:

[Console]::OutputEncoding = [Console]::OutputEncoding

Shouldn't it be like this?

[Console]::InputEncoding = [Console]::OutputEncoding = $OutputEncoding = [System.Text.UTF8Encoding]::new($false)

Sorry if I'm wrong. I'm not an expert in Powershell. ;)

ding-dang-do avatar Oct 31 '22 08:10 ding-dang-do

This issue has not had any activity in 6 months, if this is a bug please try to reproduce on the latest version of PowerShell and reopen a new issue and reference this issue if this is still a blocker for you.

This issue has been marked as "No Activity" as there has been no activity for 6 months. It has been closed for housekeeping purposes.