PowerHTML icon indicating copy to clipboard operation
PowerHTML copied to clipboard

ConvertFrom-Html parses special characters as question marks

Open dominikduennebacke opened this issue 4 years ago • 2 comments

Hi there. Really appreciate this module using PowerShell Core. Thank you for your work!

Scraping some European websites I came across an issue in regards to special characters, like ü, ä, ö, é, ß, etc. Somehow ConvertFrom-Html cannot handle these characters and parses them as question marks. It seems to be related to the encoding which cannot be specified by any parameter.

Any ideas how to solve this?

Example

Invoke-WebRequest content show the "ü" character correctly

$Result = Invoke-WebRequest -Uri "https://www.compart.com/en/unicode/U+00FC"
$Result.Content -split "<" | Where-Object {$_ -like '*span class="box">*'}

>> span class="box">ü

ConvertFrom-Html parses that into "??"

$Html = ConvertFrom-Html -Content $Result
$Html.SelectNodes('//span[@class="box"]')

>> NodeType Name AttributeCount ChildNodeCount ContentLength InnerText
>> -------- ---- -------------- -------------- ------------- ---------
>> Element  span 1              1              2             ??

Return headers show correct content-type utf-8

$Result.Headers

>> Key             Value
>> ---             -----
>> Server          {nginx}
>> Date            {Sun, 03 Oct 2021 10:22:56 GMT}
>> Connection      {keep-alive}
>> X-Powered-By    {Express}
>> Accept-Ranges   {bytes}
>> Cache-Control   {public, max-age=0}
>> ETag            {W/"aabd-17a2d88a25f"}
>> X-Response-Time {0}
>> Vary            {Accept-Encoding}
>> Content-Type    {text/html; charset=utf-8}
>> Content-Length  {43709}
>> Last-Modified   {Mon, 21 Jun 2021 07:46:07 GMT}

Version info

$PSVersionTable

>> Name                           Value
>> ----                           -----
>> PSVersion                      7.1.3
>> PSEdition                      Core
>> GitCommitId                    7.1.3
>> OS                             Darwin 20.2.0 Darwin Kernel Version 20.2.0: Wed Dec  2 20:40:21 PST 2020; root:xnu-7195.60.75~1/RELEASE_ARM64_T8101
>> Platform                       Unix
>> PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
>> PSRemotingProtocolVersion      2.3
>> SerializationVersion           1.1.0.1
>> WSManStackVersion              3.0


Get-Module PowerHTML

>> ModuleType Version    PreRelease Name                                ExportedCommands
>> ---------- -------    ---------- ----                                ----------------
>> Script     0.1.7                 PowerHTML                           ConvertFrom-Html

dominikduennebacke avatar Oct 03 '21 10:10 dominikduennebacke

This appears to require the encoding to be modified: https://html-agility-pack.net/knowledge-base/14793650/wrong-encoding-with-html-agility-pack

I am not going to have time in the near future to patch this in but I've flagged it as a hacktoberfest issue, maybe someone will be interested :)

JustinGrote avatar Oct 05 '21 15:10 JustinGrote

I looked into this as a Hacktoberfest opportunity, but unfortunately for me the problem is not in HTMLAgilityPack or PowerHTML.

$Result = Invoke-WebRequest -Uri "https://www.compart.com/en/unicode/U+00FC"

In this example, $Result is an object of type Microsoft.PowerShell.Commands.WebResponseObject:

$Result -is [Microsoft.PowerShell.Commands.WebResponseObject]

and when that's somewhat questionably passed to the -Content parameter it gets cast to a [string[]] which is allowed and works, however for some reason this conversion is what already mangles the encoding. Doing this manually results in the same ?? problem:

[string]$Result
# Same result with ToString, also wrong encoding:
$Result.ToString()

The way to get the proper encoded string from the WebResponseObject is to access the Content property:

$Html = ConvertFrom-Html -Content $Result.Content
$Html.SelectNodes('//span[@class="box"]')

The automatic lossy conversion PowerShell is doing here is unfortunate, but in the end an easy fix for the user. I could make ConvertFrom-Html accept Microsoft.PowerShell.Commands.WebResponseObject objects directly for input maybe? Then there would be no lossy casting to string happening.

jantari avatar Oct 12 '21 20:10 jantari