html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

  is not removed from the InnerText

Open FarshanAhamed opened this issue 2 years ago • 3 comments

1. Description

Here I'm trying to strip Html tags and attributes from a text and most of the tags are removed but   is staying in the text.

3. Fiddle or Project

https://dotnetfiddle.net/haBumr

public static string StripHtmlTags(this string input)
{
    var doc = new HtmlDocument();
    doc.LoadHtml(input ?? "");
    return doc.DocumentNode.InnerText;
}

Input text: <p>This is a test string.&nbsp;</p> Output text: This is a test string.&nbsp;

Is there any way I can get the text as I see in a browser?

  • HAP version: 1.11.42
  • NET version (.net core 2.2, .net core 3.1, etc.)

FarshanAhamed avatar Apr 21 '22 12:04 FarshanAhamed

See the "Decode and strip HTML" example over here: https://html-agility-pack.net/online-examples

However, contrary to that the example code, i would strongly suggest to do the entity decoding after getting the inner text, and not before loading the HTML data into HtmlAgilityPack.

elgonzo avatar Apr 21 '22 14:04 elgonzo

Great. I figured using the decode HTML earlier. But, I thought there might be a way where InnerText will decode HTML if I provide some flag while loading HTML. Thank you for your help

FarshanAhamed avatar Apr 29 '22 13:04 FarshanAhamed

LoadFromWebAsync how to decode?

snowchenlei avatar Jul 12 '22 01:07 snowchenlei