FSharp.Data icon indicating copy to clipboard operation
FSharp.Data copied to clipboard

HTML Parser: Space between entity references is deleted

Open dvdkon opened this issue 4 years ago • 3 comments

When parsing HTML where two entity references are separated by a space, calling InnerText() on the containing element returns the contents without the space.

Example (FSharp.Data from Nuget, version 3.3.3):

open FSharp.Data

[<EntryPoint>]
let main argv =
    let testDoc = HtmlDocument.Parse("<html>&lt; &gt;</html>")
    printfn "%s" ((testDoc.CssSelect("html") |> Seq.head).InnerText())

Prints "<>", should print "< >". Parsing <html>&lt;&#32;&gt;</html> gives the correct result.

dvdkon avatar Oct 26 '20 11:10 dvdkon

I think it's caused by .Trim() on this line because the tokenizer reads ahead more than one char, so content doesn't end with ';' but the following spaces. However, I'm not familiar enough with the code to confirm that this really is the case.

dvdkon avatar Oct 26 '20 11:10 dvdkon

Actually this is the line that is eating the space:

https://github.com/fsprojects/FSharp.Data/blob/3084b8660cb3c62db435b165f3f78bace22967e8/src/Html/HtmlParser.fs#L381

InsertionMode is DefaultMode at that point, and it eats whitespace. I'm not sure if it's a bug or desired behavior. Are you sure you shouldn't be using &nbsp; or &#32; to preserve the space here?

jimfoye avatar Nov 17 '21 03:11 jimfoye

In case there are nested tags

let text1 =
    HtmlNode.ParseRooted("div", "<div><span>Hello,</span> <span>World</span></div>")
    |> HtmlNode.innerText
// returns "Hello,World"

let text2 =
    HtmlNode.ParseRooted("div", "<div><span>Hello</span>, <span>World</span></div>")
    |> HtmlNode.innerText
// returns "Hello, World"

Either cases, web browser display as Hello, World. Would expect the space is kept as is.

sonbua avatar Nov 22 '23 21:11 sonbua