nokogiri
nokogiri copied to clipboard
Java version of sax parser doesn't care about the order of text nodes
require "nokogiri"
class Doc < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
puts "start: \"#{name}\""
end
def characters string
puts "characters: \"#{string}\""
end
def end_element name
puts "end: \"#{name}\""
end
end
Nokogiri::HTML::SAX::Parser.new(Doc.new).parse "
<html>
<head>
<title>title</title>
</head>
<body>
<p>text 1</p> text 2
<p>text 3</p> text 4
</body>
</html>
"
MRI Ruby result:
start: "html"
start: "head"
start: "title"
characters: "title"
end: "title"
end: "head"
start: "body"
characters: "
"
start: "p"
characters: "text 1"
end: "p"
characters: " text 2
"
start: "p"
characters: "text 3"
end: "p"
characters: " text 4
"
end: "body"
end: "html"
text 1 -> text 2 -> text 3 -> text 4
JRuby result:
start: "html"
start: "head"
start: "title"
characters: "title"
end: "title"
characters: "
"
end: "head"
start: "body"
start: "p"
characters: "text 1"
end: "p"
start: "p"
characters: "text 3"
end: "p"
characters: "
text 2
text 4
"
end: "body"
characters: ""
end: "html"
text 1 -> text 3 -> text 2 -> text 4
I am going to provide a pull request on the next weekend.
@andrew-aladev before you put together a PR (which I've seen you've already done), I'd like to understand the nature of the problem. Presuming you've diagnosed the issue, can you help me understand here what that issue is?
@flavorjones, Yes sure. But my english is not very good, sory for inconvenience.
As I know nokogiri is designed to work with MRI ruby and JRuby in the same way. I've found a place where java version of SAX HTML parser has a different behaviour than libxml2 SAX HTML parser.
Please take a look at def test_order in test/html/sax/test_parser_text.rb that I've added in this PR.
<p>
text 1
<span>text 2</span>
text 3
</p>
libxml2 SAX HTML parser will provide something like:
- start element
p. - characters
text 1 - start element
span. - characters
text 2. - end element
span. - characters
text 3. - end element
p.
Nokogiri java SAX HTML parser will provide:
- start element
p. - start element
span. - characters
text 2. - end element
span. - characters
text 1 text 3. - end element
p.
This was fixed in v1.8.2 by @andrew-aladev's contribution at https://github.com/sparklemotion/nokogiri/pull/1676