PSScriptAnalyzer icon indicating copy to clipboard operation
PSScriptAnalyzer copied to clipboard

-Fix adding UTF-8 BOM to files

Open jborean93 opened this issue 4 years ago • 4 comments

Steps to reproduce

Create a file with a fixable violation

Function Test-Function {
    param (
        [String]
        $Password
    )

    $Password
}

Run Invoke-ScriptAnalyzer -Path ./file.ps1 -Fix to fix the violation

Expected behavior

The violation is fixed but the BOM isn't added to the file. It should preserve whatever is there already.

Actual behavior

The violation is fixed but there is a UTF-8 BOM character added to the file

# Before
$ hexdump -C file.ps1

00000000  46 75 6e 63 74 69 6f 6e  20 54 65 73 74 2d 46 75  |Function Test-Fu|
00000010  6e 63 74 69 6f 6e 20 7b  0a 20 20 20 20 70 61 72  |nction {.    par|
00000020  61 6d 20 28 0a 20 20 20  20 20 20 20 20 5b 53 74  |am (.        [St|
00000030  72 69 6e 67 5d 0a 20 20  20 20 20 20 20 20 24 50  |ring].        $P|
00000040  61 73 73 77 6f 72 64 0a  20 20 20 20 29 0a 0a 20  |assword.    ).. |
00000050  20 20 20 24 50 61 73 73  77 6f 72 64 0a 7d 0a     |   $Password.}.|
0000005f

# After running -Fix
# hexdump -C file.ps1

00000000  ef bb bf 46 75 6e 63 74  69 6f 6e 20 54 65 73 74  |...Function Test|
00000010  2d 46 75 6e 63 74 69 6f  6e 20 7b 0a 20 20 20 20  |-Function {.    |
00000020  70 61 72 61 6d 20 28 0a  20 20 20 20 20 20 20 20  |param (.        |
00000030  5b 53 65 63 75 72 65 53  74 72 69 6e 67 5d 0a 20  |[SecureString]. |
00000040  20 20 20 20 20 20 20 24  50 61 73 73 77 6f 72 64  |       $Password|
00000050  0a 20 20 20 20 29 0a 0a  20 20 20 20 24 50 61 73  |.    )..    $Pas|
00000060  73 77 6f 72 64 0a 7d 0a                           |sword.}.|
00000068

Note the ef bb bf (UTF-8 BOM) present. I tried explicitly ignoring the rule UseBOMForUnicodeEncodedFile as well but the BOM is still added.

Environment data

> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.2.0
PSEdition                      Core
GitCommitId                    7.2.0
OS                             Linux 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021
Platform                       Unix
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

> (Get-Module -ListAvailable PSScriptAnalyzer).Version | ForEach-Object { $_.ToString() }

1.20.0

jborean93 avatar Nov 11 '21 01:11 jborean93

This likely won't do anything, but, what if you set the environment's encoding to non-BOM UTF8?

$global:OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

ninmonkey avatar Nov 13 '21 18:11 ninmonkey

Thanks @jborean93, we were able to successful reproduce this issue and identify the bug. We appreciate you taking the time to reproduce this bug, we are unsure when we will be able to fix this issue.

StevenBucher98 avatar Dec 07 '21 23:12 StevenBucher98

Because -Fix reads the content of the file into a string, modifies it and then writes back to the string, the issue could either be in

  • The .NET APIs used to read/write interpret incorrect encoding, I've tried different variations when I implemented it as I've seen this behaviour and chose the set of API overloads where most files preserved their encoding
  • The formatter engine could be at fault because of the way it modifies the in-memory string

This is a very tricky and time consuming one. Bear in mind that there is no standard for encoding and both .NET and editors apply heuristics to figure out the encoding of a file and therefore can get it wrong. My recommendation is to check the encoding in the diff editor before commiting files after using -Fix as this parameter is usually just used for one off fixes of a whole codebase.

bergmeister avatar Dec 15 '21 21:12 bergmeister

I personally don't think the onus should be on the user to detect the BOM that was added. If the file doesn't have a BOM then the formatter shouldn't be adding one. I'm unsure what the level of support PSSA has for WinPS but considering the default of PowerShell since v6 is to read BOM-less files as UTF-8 then I personally believe PSSA should output as UTF-8 if the input didn't have a BOM.

Putting aside the issues with detecting the encoding, scripts that only contain ASCII characters won't be affected by reading the script as 1 encoding and writing it as UTF-8 as the bytes will be all the same. Having a mismatch set of encoding values only affects non-ASCII characters. Even if WinPS was still a concern I honestly think having a default of UTF-8 as the output when no BOM was detected on the input will help encourage users to have a UTF-8 + BOM file in the first place. Relying on the default locale encoding is very brittle, especially when it comes to sharing scripts as the default could differ across hosts.

jborean93 avatar Dec 15 '21 21:12 jborean93