html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

HtmlNode.GetEncapsulatedData fails when target property is a nullable value type

Open elgonzo opened this issue 1 year ago • 3 comments

1. Description

HtmlNode.GetEncapsulatedData throws an exception when properties of some nullable value type should be populated.

2. Exception

Unhandled exception. System.Exception: Unhandled Exception : Invalid cast from 'System.String' to 'System.Nullable`1[[System.Int32, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]]'.
   at HtmlAgilityPack.HtmlNode.GetEncapsulatedData(Type targetType, HtmlDocument htmlDocument) in C:\Repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlNode.Encapsulator.cs:line 224
   at HtmlAgilityPack.HtmlNode.GetEncapsulatedData[T]() in C:\Repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlNode.Encapsulator.cs:line 36

3. Fiddle or Project

https://dotnetfiddle.net/6JUCcn

[HasXPath]
public class Model
{
    [XPath("div", "foo")]
    public int? NullableInt {get;set;}
}

					
string html = @"<div foo=""42"">Hello world!</div>";
		
var doc = new HtmlDocument();
doc.LoadHtml(html);
var m = doc.DocumentNode.GetEncapsulatedData<Model>();
		
Console.WriteLine($"m.NullableInt = {m.NullableInt}");

4. Any further technical details

HAP 1.11.50 .NET 7 (as per dotnetfiddle demo linked above)

elgonzo avatar Jul 25 '23 09:07 elgonzo

Even though the stacktrace of the exception doesn't reveal it [1], the exception is caused by a call of Convert.ChangeType. According to Github code search, only the HtmlNode.Encapsulator.cs source file features Convert.ChangeType calls: https://github.com/search?q=repo%3Azzzprojects%2Fhtml-agility-pack%20Convert.ChangeType&type=code

To properly support nullable value types, calls of Convert.ChangeType could be replaced with a utility function similar to this:

static object ChangeType(string value, Type targetType)
{
    var nullableUnderlyingValueType = Nullable.GetUnderlyingType(targetType);
    if (nullableUnderlyingValueType == null)
    {
        return Convert.ChangeType(value, targetType);
    }

    var valueInstance = Convert.ChangeType(value, nullableUnderlyingValueType);
    return Activator.CreateInstance(targetType, valueInstance);
}

(Not sure if the HAP code has otherwise issues with nullable value types, but since Convert.ChangeType famously doesn't support nullable value types, i focused on call sites of Convert.ChangeType.)


[1] Which isn't helped by the fact that the GetEncapsulatedData implementation seems to have a habit of catching exceptions and then replacing that exception with another more generic exception with less information:

https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlNode.Encapsulator.cs#L222-L225

https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlNode.Encapsulator.cs#L320-L323

https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlNode.Encapsulator.cs#L347-L350

Oof... What is the point of doing that at all?

There are also many other catch clauses in the GetEncapsulatedData implementation that catch an exception and throw a new exception with a different exception message. But, except two such catch blocks, the caught original exception is then not passed as InnerException to the newly created exception. Which isn't really helping troubleshootingand maintenance efforts by zzzproject, as users reporting problems typically can only report what they do and the information they can obtain from the exceptions they see, and some such...

elgonzo avatar Jul 25 '23 09:07 elgonzo

Additionally, I'm not sure if this is a major feature. Apart from the three places where I have seen type conversions so far, are there any other areas involved? Will it potentially affect more people since the project has not enabled the Nullable project option?

Also, some of our files are encoded in Windows-1252. Every time I make changes, to avoid disrupting others' work, I save them back as Windows-1252. This seems to be a specific encoding that certain editors use for saving files, right? It that correct to keep Windows-1252 or should we switch to UTF-8 with BOM?

rwecho avatar Jul 26 '23 01:07 rwecho

Will it potentially affect more people since the project has not enabled the Nullable project option?

The <nullable> setting in .csproj project files (and the equivalent #nullable pragma) is about nullable reference types only. It is not about nullable value types and therefore has no impact on nor is it an indication of whether a project can or does use nullable value types.

Also, some of our files are encoded in Windows-1252. Every time I make changes, to avoid disrupting others' work, I save them back as Windows-1252. This seems to be a specific encoding that certain editors use for saving files, right? It that correct to keep Windows-1252 or should we switch to UTF-8 with BOM?

If this is about contributions / pull requests to projects you are not a core member of, then try to keep the original text encoding of the text/source file you modify. For any other text files that belong to a project or team you are part of, make this a team decision. Generally speaking, prefer UTF-8 (preferably with BOM) over vintage text encodings/code pages, unless there is a very strong reason not to...

elgonzo avatar Jul 26 '23 12:07 elgonzo