ConvertFrom-Html parses special characters as question marks
Hi there. Really appreciate this module using PowerShell Core. Thank you for your work!
Scraping some European websites I came across an issue in regards to special characters, like ü, ä, ö, é, ß, etc. Somehow ConvertFrom-Html cannot handle these characters and parses them as question marks. It seems to be related to the encoding which cannot be specified by any parameter.
Any ideas how to solve this?
Example
Invoke-WebRequest content show the "ü" character correctly
$Result = Invoke-WebRequest -Uri "https://www.compart.com/en/unicode/U+00FC"
$Result.Content -split "<" | Where-Object {$_ -like '*span class="box">*'}
>> span class="box">ü
ConvertFrom-Html parses that into "??"
$Html = ConvertFrom-Html -Content $Result
$Html.SelectNodes('//span[@class="box"]')
>> NodeType Name AttributeCount ChildNodeCount ContentLength InnerText
>> -------- ---- -------------- -------------- ------------- ---------
>> Element span 1 1 2 ??
Return headers show correct content-type utf-8
$Result.Headers
>> Key Value
>> --- -----
>> Server {nginx}
>> Date {Sun, 03 Oct 2021 10:22:56 GMT}
>> Connection {keep-alive}
>> X-Powered-By {Express}
>> Accept-Ranges {bytes}
>> Cache-Control {public, max-age=0}
>> ETag {W/"aabd-17a2d88a25f"}
>> X-Response-Time {0}
>> Vary {Accept-Encoding}
>> Content-Type {text/html; charset=utf-8}
>> Content-Length {43709}
>> Last-Modified {Mon, 21 Jun 2021 07:46:07 GMT}
Version info
$PSVersionTable
>> Name Value
>> ---- -----
>> PSVersion 7.1.3
>> PSEdition Core
>> GitCommitId 7.1.3
>> OS Darwin 20.2.0 Darwin Kernel Version 20.2.0: Wed Dec 2 20:40:21 PST 2020; root:xnu-7195.60.75~1/RELEASE_ARM64_T8101
>> Platform Unix
>> PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
>> PSRemotingProtocolVersion 2.3
>> SerializationVersion 1.1.0.1
>> WSManStackVersion 3.0
Get-Module PowerHTML
>> ModuleType Version PreRelease Name ExportedCommands
>> ---------- ------- ---------- ---- ----------------
>> Script 0.1.7 PowerHTML ConvertFrom-Html
This appears to require the encoding to be modified: https://html-agility-pack.net/knowledge-base/14793650/wrong-encoding-with-html-agility-pack
I am not going to have time in the near future to patch this in but I've flagged it as a hacktoberfest issue, maybe someone will be interested :)
I looked into this as a Hacktoberfest opportunity, but unfortunately for me the problem is not in HTMLAgilityPack or PowerHTML.
$Result = Invoke-WebRequest -Uri "https://www.compart.com/en/unicode/U+00FC"
In this example, $Result is an object of type Microsoft.PowerShell.Commands.WebResponseObject:
$Result -is [Microsoft.PowerShell.Commands.WebResponseObject]
and when that's somewhat questionably passed to the -Content parameter it gets cast to a [string[]] which is allowed and works, however for some reason this conversion is what already mangles the encoding. Doing this manually results in the same ?? problem:
[string]$Result
# Same result with ToString, also wrong encoding:
$Result.ToString()
The way to get the proper encoded string from the WebResponseObject is to access the Content property:
$Html = ConvertFrom-Html -Content $Result.Content
$Html.SelectNodes('//span[@class="box"]')
The automatic lossy conversion PowerShell is doing here is unfortunate, but in the end an easy fix for the user.
I could make ConvertFrom-Html accept Microsoft.PowerShell.Commands.WebResponseObject objects directly for input maybe?
Then there would be no lossy casting to string happening.