anti-xml icon indicating copy to clipboard operation
anti-xml copied to clipboard

Stack Overflow when parsing HTML

Open lihaoyi opened this issue 12 years ago • 1 comments

I'm grabbing the XHTML 1.0 from this page:

http://en.wikipedia.org/wiki/Adams_State_College

As one big string blob. This is in Scala 2.9.1. on Windows 7, JRE 1.6. When i try to perform:

var xml = XML.fromString(body)

It's throwing me a StackOverflowException, a short segment of the stack trace looks like:

at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)

The rest of the stack trace looks about the same. This is a pretty typical, if slightly large, XHTML page (~450kb), and it should not be cause the XML parser to fail.

lihaoyi avatar Mar 15 '12 21:03 lihaoyi

I think this was fixed in #67, can you try loading your files using the version on master?

ncreep avatar Mar 17 '12 18:03 ncreep