floki icon indicating copy to clipboard operation
floki copied to clipboard

parse_fragment does not parse whitespace in HTML (or XML) text properly

Open calimeroteknik opened this issue 3 years ago • 0 comments

Description

parse_fragment does not parse whitespace in HTML (or XML) text properly, keeping it as-is when it should not.

To Reproduce

Steps to reproduce the behavior:

  • Using Floki v0.33.1
  • Using Elixir v1.13.2
  • Using Erlang OTP 24.3.2 [erts-12.3]
  • With this code:
      Floki.parse_document("<!DOCTYPE html>\n<html>\n\t<head>\n\t\t<title> \t&#110;&#111;&#116;&#104;&#105;&#110;&#103;\t\n\t\t\t &#116;&#111;\n&#115;&#101;&#101;  &#104;&#101;&#114;&#101;&#44;&#32;&#119;&#111;&#114;&#107;&#105;&#110;&#103;&#32;&#112;&#114;&#111;&#112;&#101;&#114;&#108;&#121; \n\n\t\t</title>\n\t</head>\n\t<body>\n\t</body>\n</html>\n")
        |> Rustic.Result.map_err(fn reason -> {:invalid_html, reason} end)
        |> Rustic.Result.and_then(fn doc ->
          data = doc
            |> Floki.find("head > title")
            |> Enum.take(1)
            |> Floki.text()
            |> Floki.HTMLParser.parse_fragment()
    
        end)
    
    I get the following output:
    {:ok, [" \tnothing\t\n\t\t\t to\nsee  here, working properly \n\n\t\t"]}
    

Expected behavior

The following output:

{:ok, [" nothing to see here, working properly "]}

(I think that the leading and trailing space must not be trimmed, although like the others it must be collapsed to 1 space; this might need triple-checking with the standards)

Test file (HTML): floki-test.html.txt

calimeroteknik avatar Sep 18 '22 02:09 calimeroteknik