nokogiri Java version of sax parser doesn't care about the order of text nodes

Java version of sax parser doesn't care about the order of text nodes

Open andrew-aladev opened this issue 8 years ago • 3 comments

require "nokogiri"

class Doc < Nokogiri::XML::SAX::Document
  def start_element name, attrs = []
    puts "start: \"#{name}\""
  end

  def characters string
    puts "characters: \"#{string}\""
  end

  def end_element name
    puts "end: \"#{name}\""
  end
end

Nokogiri::HTML::SAX::Parser.new(Doc.new).parse "
<html>
  <head>
    <title>title</title>
  </head>
  <body>
    <p>text 1</p> text 2
    <p>text 3</p> text 4
  </body>
</html>
"

MRI Ruby result:

start: "html"
start: "head"
start: "title"
characters: "title"
end: "title"
end: "head"
start: "body"
characters: "
    "
start: "p"
characters: "text 1"
end: "p"
characters: " text 2
    "
start: "p"
characters: "text 3"
end: "p"
characters: " text 4
  "
end: "body"
end: "html"

text 1 -> text 2 -> text 3 -> text 4

JRuby result:

start: "html"
start: "head"
start: "title"
characters: "title"
end: "title"
characters: "
    
  "
end: "head"
start: "body"
start: "p"
characters: "text 1"
end: "p"
start: "p"
characters: "text 3"
end: "p"
characters: "
     text 2
     text 4
  
"
end: "body"
characters: ""
end: "html"

text 1 -> text 3 -> text 2 -> text 4

Dec 16 '16 18:12 andrew-aladev

I am going to provide a pull request on the next weekend.

Dec 27 '16 10:12 andrew-aladev

@andrew-aladev before you put together a PR (which I've seen you've already done), I'd like to understand the nature of the problem. Presuming you've diagnosed the issue, can you help me understand here what that issue is?

Dec 27 '16 20:12 flavorjones

@flavorjones, Yes sure. But my english is not very good, sory for inconvenience.

As I know nokogiri is designed to work with MRI ruby and JRuby in the same way. I've found a place where java version of SAX HTML parser has a different behaviour than libxml2 SAX HTML parser.

Please take a look at def test_order in test/html/sax/test_parser_text.rb that I've added in this PR.

<p>
  text 1
  <span>text 2</span>
  text 3
</p>

libxml2 SAX HTML parser will provide something like:

start element p.
characters text 1
start element span.
characters text 2.
end element span.
characters text 3.
end element p.

Nokogiri java SAX HTML parser will provide:

start element p.
start element span.
characters text 2.
end element span.
characters text 1 text 3.
end element p.

Dec 27 '16 20:12 andrew-aladev

This was fixed in v1.8.2 by @andrew-aladev's contribution at https://github.com/sparklemotion/nokogiri/pull/1676

Jul 02 '24 20:07 flavorjones

nokogiri nokogiri copied to clipboard

Java version of sax parser doesn't care about the order of text nodes

nokogiri
nokogiri copied to clipboard