html-agility-pack
html-agility-pack copied to clipboard
HtmlNode.GetEncapsulatedData fails when target property is a nullable value type
1. Description
HtmlNode.GetEncapsulatedData throws an exception when properties of some nullable value type should be populated.
2. Exception
Unhandled exception. System.Exception: Unhandled Exception : Invalid cast from 'System.String' to 'System.Nullable`1[[System.Int32, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]]'.
at HtmlAgilityPack.HtmlNode.GetEncapsulatedData(Type targetType, HtmlDocument htmlDocument) in C:\Repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlNode.Encapsulator.cs:line 224
at HtmlAgilityPack.HtmlNode.GetEncapsulatedData[T]() in C:\Repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlNode.Encapsulator.cs:line 36
3. Fiddle or Project
https://dotnetfiddle.net/6JUCcn
[HasXPath]
public class Model
{
[XPath("div", "foo")]
public int? NullableInt {get;set;}
}
string html = @"<div foo=""42"">Hello world!</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var m = doc.DocumentNode.GetEncapsulatedData<Model>();
Console.WriteLine($"m.NullableInt = {m.NullableInt}");
4. Any further technical details
HAP 1.11.50 .NET 7 (as per dotnetfiddle demo linked above)
Even though the stacktrace of the exception doesn't reveal it [1], the exception is caused by a call of Convert.ChangeType
. According to Github code search, only the HtmlNode.Encapsulator.cs source file features Convert.ChangeType calls: https://github.com/search?q=repo%3Azzzprojects%2Fhtml-agility-pack%20Convert.ChangeType&type=code
To properly support nullable value types, calls of Convert.ChangeType could be replaced with a utility function similar to this:
static object ChangeType(string value, Type targetType)
{
var nullableUnderlyingValueType = Nullable.GetUnderlyingType(targetType);
if (nullableUnderlyingValueType == null)
{
return Convert.ChangeType(value, targetType);
}
var valueInstance = Convert.ChangeType(value, nullableUnderlyingValueType);
return Activator.CreateInstance(targetType, valueInstance);
}
(Not sure if the HAP code has otherwise issues with nullable value types, but since Convert.ChangeType famously doesn't support nullable value types, i focused on call sites of Convert.ChangeType.)
[1] Which isn't helped by the fact that the GetEncapsulatedData implementation seems to have a habit of catching exceptions and then replacing that exception with another more generic exception with less information:
https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlNode.Encapsulator.cs#L222-L225
https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlNode.Encapsulator.cs#L320-L323
https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlNode.Encapsulator.cs#L347-L350
Oof... What is the point of doing that at all?
There are also many other catch clauses in the GetEncapsulatedData implementation that catch an exception and throw a new exception with a different exception message. But, except two such catch blocks, the caught original exception is then not passed as InnerException to the newly created exception. Which isn't really helping troubleshootingand maintenance efforts by zzzproject, as users reporting problems typically can only report what they do and the information they can obtain from the exceptions they see, and some such...
Additionally, I'm not sure if this is a major feature. Apart from the three places where I have seen type conversions so far, are there any other areas involved? Will it potentially affect more people since the project has not enabled the Nullable project option?
Also, some of our files are encoded in Windows-1252. Every time I make changes, to avoid disrupting others' work, I save them back as Windows-1252. This seems to be a specific encoding that certain editors use for saving files, right? It that correct to keep Windows-1252 or should we switch to UTF-8 with BOM?
Will it potentially affect more people since the project has not enabled the Nullable project option?
The <nullable>
setting in .csproj project files (and the equivalent #nullable
pragma) is about nullable reference types only. It is not about nullable value types and therefore has no impact on nor is it an indication of whether a project can or does use nullable value types.
Also, some of our files are encoded in Windows-1252. Every time I make changes, to avoid disrupting others' work, I save them back as Windows-1252. This seems to be a specific encoding that certain editors use for saving files, right? It that correct to keep Windows-1252 or should we switch to UTF-8 with BOM?
If this is about contributions / pull requests to projects you are not a core member of, then try to keep the original text encoding of the text/source file you modify. For any other text files that belong to a project or team you are part of, make this a team decision. Generally speaking, prefer UTF-8 (preferably with BOM) over vintage text encodings/code pages, unless there is a very strong reason not to...