PSScriptAnalyzer
PSScriptAnalyzer copied to clipboard
-Fix adding UTF-8 BOM to files
Steps to reproduce
Create a file with a fixable violation
Function Test-Function {
param (
[String]
$Password
)
$Password
}
Run Invoke-ScriptAnalyzer -Path ./file.ps1 -Fix to fix the violation
Expected behavior
The violation is fixed but the BOM isn't added to the file. It should preserve whatever is there already.
Actual behavior
The violation is fixed but there is a UTF-8 BOM character added to the file
# Before
$ hexdump -C file.ps1
00000000 46 75 6e 63 74 69 6f 6e 20 54 65 73 74 2d 46 75 |Function Test-Fu|
00000010 6e 63 74 69 6f 6e 20 7b 0a 20 20 20 20 70 61 72 |nction {. par|
00000020 61 6d 20 28 0a 20 20 20 20 20 20 20 20 5b 53 74 |am (. [St|
00000030 72 69 6e 67 5d 0a 20 20 20 20 20 20 20 20 24 50 |ring]. $P|
00000040 61 73 73 77 6f 72 64 0a 20 20 20 20 29 0a 0a 20 |assword. ).. |
00000050 20 20 20 24 50 61 73 73 77 6f 72 64 0a 7d 0a | $Password.}.|
0000005f
# After running -Fix
# hexdump -C file.ps1
00000000 ef bb bf 46 75 6e 63 74 69 6f 6e 20 54 65 73 74 |...Function Test|
00000010 2d 46 75 6e 63 74 69 6f 6e 20 7b 0a 20 20 20 20 |-Function {. |
00000020 70 61 72 61 6d 20 28 0a 20 20 20 20 20 20 20 20 |param (. |
00000030 5b 53 65 63 75 72 65 53 74 72 69 6e 67 5d 0a 20 |[SecureString]. |
00000040 20 20 20 20 20 20 20 24 50 61 73 73 77 6f 72 64 | $Password|
00000050 0a 20 20 20 20 29 0a 0a 20 20 20 20 24 50 61 73 |. ).. $Pas|
00000060 73 77 6f 72 64 0a 7d 0a |sword.}.|
00000068
Note the ef bb bf (UTF-8 BOM) present. I tried explicitly ignoring the rule UseBOMForUnicodeEncodedFile as well but the BOM is still added.
Environment data
> $PSVersionTable
Name Value
---- -----
PSVersion 7.2.0
PSEdition Core
GitCommitId 7.2.0
OS Linux 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021
Platform Unix
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0
> (Get-Module -ListAvailable PSScriptAnalyzer).Version | ForEach-Object { $_.ToString() }
1.20.0
This likely won't do anything, but, what if you set the environment's encoding to non-BOM UTF8?
$global:OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
Thanks @jborean93, we were able to successful reproduce this issue and identify the bug. We appreciate you taking the time to reproduce this bug, we are unsure when we will be able to fix this issue.
Because -Fix reads the content of the file into a string, modifies it and then writes back to the string, the issue could either be in
- The .NET APIs used to read/write interpret incorrect encoding, I've tried different variations when I implemented it as I've seen this behaviour and chose the set of API overloads where most files preserved their encoding
- The formatter engine could be at fault because of the way it modifies the in-memory string
This is a very tricky and time consuming one. Bear in mind that there is no standard for encoding and both .NET and editors apply heuristics to figure out the encoding of a file and therefore can get it wrong. My recommendation is to check the encoding in the diff editor before commiting files after using -Fix as this parameter is usually just used for one off fixes of a whole codebase.
I personally don't think the onus should be on the user to detect the BOM that was added. If the file doesn't have a BOM then the formatter shouldn't be adding one. I'm unsure what the level of support PSSA has for WinPS but considering the default of PowerShell since v6 is to read BOM-less files as UTF-8 then I personally believe PSSA should output as UTF-8 if the input didn't have a BOM.
Putting aside the issues with detecting the encoding, scripts that only contain ASCII characters won't be affected by reading the script as 1 encoding and writing it as UTF-8 as the bytes will be all the same. Having a mismatch set of encoding values only affects non-ASCII characters. Even if WinPS was still a concern I honestly think having a default of UTF-8 as the output when no BOM was detected on the input will help encourage users to have a UTF-8 + BOM file in the first place. Relying on the default locale encoding is very brittle, especially when it comes to sharing scripts as the default could differ across hosts.