floki
floki copied to clipboard
parse_fragment does not parse whitespace in HTML (or XML) text properly
Description
parse_fragment does not parse whitespace in HTML (or XML) text properly, keeping it as-is when it should not.
To Reproduce
Steps to reproduce the behavior:
- Using Floki v0.33.1
- Using Elixir v1.13.2
- Using Erlang OTP 24.3.2 [erts-12.3]
- With this code:
I get the following output:Floki.parse_document("<!DOCTYPE html>\n<html>\n\t<head>\n\t\t<title> \tnothing\t\n\t\t\t to\nsee here, working properly \n\n\t\t</title>\n\t</head>\n\t<body>\n\t</body>\n</html>\n") |> Rustic.Result.map_err(fn reason -> {:invalid_html, reason} end) |> Rustic.Result.and_then(fn doc -> data = doc |> Floki.find("head > title") |> Enum.take(1) |> Floki.text() |> Floki.HTMLParser.parse_fragment() end){:ok, [" \tnothing\t\n\t\t\t to\nsee here, working properly \n\n\t\t"]}
Expected behavior
The following output:
{:ok, [" nothing to see here, working properly "]}
(I think that the leading and trailing space must not be trimmed, although like the others it must be collapsed to 1 space; this might need triple-checking with the standards)
Test file (HTML): floki-test.html.txt