floki icon indicating copy to clipboard operation
floki copied to clipboard

Floki.parse differs when using html5ever

Open andyleclair opened this issue 4 years ago • 4 comments

Description

Mochiweb Floki will produce different output than html5ever, namely, the output of Floki.parse will be wrapped in <html><head></head><body>...</body></html>

To Reproduce

Steps to reproduce the behavior:

  • Using Floki v0.23.0
  • Using html5ever
  • Using Elixir v1.9.3
  • Using Erlang OTP v21.3.8.9
  • With this code:
defmodule TestCases do
  @test_cases [
    {
      ~s[<a href="javascript:alert('XSS');">Click here</a>],
      ~s[<a href="#">Click here</a>]
    },
    {
      ~s[<a href="whatever" onclick="alert('XSS');">Click here</a>],
      ~s[<a href="whatever">Click here</a>],
    },
    {
      ~s[<body onload="alert('XSS')"><p>Hello</p></body>],
      ~s[<body><p>Hello</p></body>],
    },
    {
      ~s[<img src="javascript:alert('XSS');">],
      ~s[<img src="#"/>],
    },
    {
      ~s[<script>alert('XSS');</script>],
      ~s[],
    },
    {
      ~s[<body background="javascript:alert('XSS');"><p>Hello</p></body>],
      ~s[<body background="#"><p>Hello</p></body>],
    },
    {
      ~s[<style>body { background-image: expression('alert("XSS")'); }</style>],
      ~s[<style>body { background-image: removed_by_strip_js('alert("XSS")'); }</style>],
    },
    {
      ~s[<style>body { background-image: url('javascript:alert("XSS")'); }</style>],
      ~s[<style>body { background-image: url('removed_by_strip_js:alert("XSS")'); }</style>],
    },
    {
      ~s[<style><script>alert('XSS')</script></style>],
      ~s[<style><script>alert('XSS')</script></style>],
    },
    {
      ~s[<style> h1 > a { color: red; } </style>],
      ~s[<style> h1 > a { color: red; } </style>],
    },
    {
      ~s[<],
      ~s[&lt;],
    },
    {
      ~s[>],
      ~s[&gt;],
    },
    {
      ~s[],
      ~s[],
    },
  ]

  def test_cases, do: @test_cases
end

TestCases.test_cases |> Enum.map(fn {ins, _outs} -> Floki.parse(ins) end)

[                                                                                                                                                                                                                                                                                         
  [                                                                                                                                                                                                                                                                                       
    {"html", [],                                                                                                                                                                                                                                                                          
     [                                                                                                                                                                                                                                                                                    
       {"head", [], []},                                                                                                                                                                                                                                                                  
       {"body", [],                                                                                                                                                                                                                                                                       
        [{"a", [{"href", "javascript:alert('XSS');"}], ["Click here"]}]}                                                                                                                                                                                                                  
     ]}                                                                                                                                                                                                                                                                                   
  ],                                                                                                                                                                                                                                                                                      
  [                                                                                                                                                                                                                                                                                       
    {"html", [],                                                                                                                                                                                                                                                                          
     [                                                                                                                                                                                                                                                                                    
       {"head", [], []},                                                                                                                                                                                                                                                                  
       {"body", [],                                                                                                                                                                                                                                                                       
        [
          {"a", [{"href", "whatever"}, {"onclick", "alert('XSS');"}],
           ["Click here"]}
        ]}
     ]}
  ],
  [
    {"html", [],
     [
       {"head", [], []},
       {"body", [{"onload", "alert('XSS')"}], [{"p", [], ["Hello"]}]}
     ]}
  ],
  [
    {"html", [],
     [
       {"head", [], []},
       {"body", [], [{"img", [{"src", "javascript:alert('XSS');"}], []}]}
     ]}
  ],
  [
    {"html", [],
     [{"head", [], [{"script", [], ["alert('XSS');"]}]}, {"body", [], []}]}
  ],
...
]

Expected behavior

I'd expect that the output would match the the output of calling this without the html5ever parser, namely, that it'd just be the fragments themselves.

andyleclair avatar Nov 08 '19 22:11 andyleclair

@andyleclair Thank you for opening the issue.

This is a problem that we have because we don't consider parsing fragments as something different, when we should. html5ever's parses fragments as full documents because we (floki) don't distinguish this when calling it.

I'm planning to add a Floki.parse_fragment to differ from the standard Floki.parse because the HTML specs treats them as different algorithms, and with this we can call the correct functions on html5ever's side.

This should be fixed once I finish the work on the internal parser (#204).

philss avatar Nov 14 '19 21:11 philss

I see that this report got closed. Was there any resolution? We are currently handling the specific case of a fragment wrapped in the default wrapper, but I'd love to tear that code out

andyleclair avatar Jan 13 '20 15:01 andyleclair

@andyleclair it was not fixed. It's a known issue. I kept the issue fixed in the issues list, but I will let it open too.

philss avatar Jun 07 '20 15:06 philss

Is it really a problem from floki? After reading code I start to think it's from html5ever_elixir.

Matsa59 avatar Apr 05 '23 16:04 Matsa59