anti-xml
anti-xml copied to clipboard
Stack Overflow when parsing HTML
I'm grabbing the XHTML 1.0 from this page:
http://en.wikipedia.org/wiki/Adams_State_College
As one big string blob. This is in Scala 2.9.1. on Windows 7, JRE 1.6. When i try to perform:
var xml = XML.fromString(body)
It's throwing me a StackOverflowException, a short segment of the stack trace looks like:
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
The rest of the stack trace looks about the same. This is a pretty typical, if slightly large, XHTML page (~450kb), and it should not be cause the XML parser to fail.
I think this was fixed in #67, can you try loading your files using the version on master
?