FSharp.Data
FSharp.Data copied to clipboard
HTML Parser: Space between entity references is deleted
When parsing HTML where two entity references are separated by a space, calling InnerText()
on the containing element returns the contents without the space.
Example (FSharp.Data from Nuget, version 3.3.3):
open FSharp.Data
[<EntryPoint>]
let main argv =
let testDoc = HtmlDocument.Parse("<html>< ></html>")
printfn "%s" ((testDoc.CssSelect("html") |> Seq.head).InnerText())
Prints "<>", should print "< >". Parsing <html>< ></html>
gives the correct result.
I think it's caused by .Trim()
on this line because the tokenizer reads ahead more than one char, so content
doesn't end with ';'
but the following spaces. However, I'm not familiar enough with the code to confirm that this really is the case.
Actually this is the line that is eating the space:
https://github.com/fsprojects/FSharp.Data/blob/3084b8660cb3c62db435b165f3f78bace22967e8/src/Html/HtmlParser.fs#L381
InsertionMode is DefaultMode at that point, and it eats whitespace. I'm not sure if it's a bug or desired behavior. Are you sure you shouldn't be using
or  
to preserve the space here?
In case there are nested tags
let text1 =
HtmlNode.ParseRooted("div", "<div><span>Hello,</span> <span>World</span></div>")
|> HtmlNode.innerText
// returns "Hello,World"
let text2 =
HtmlNode.ParseRooted("div", "<div><span>Hello</span>, <span>World</span></div>")
|> HtmlNode.innerText
// returns "Hello, World"
Either cases, web browser display as Hello, World
. Would expect the space is kept as is.