enlive icon indicating copy to clipboard operation
enlive copied to clipboard

StackOverflowError when parsing certain html

Open prismofeverything opened this issue 9 years ago • 4 comments

Using enlive when reading certain urls gives me a StackOverflowError, with these parts of the stacktrace repeated over and over:

                           clojure.core/mapcat         core.clj: 2660
                             clojure.core/apply         core.clj:  630
                               clojure.core/seq         core.clj:  137
                                            ...                       
                            clojure.core/map/fn         core.clj: 2622
net.cgrand.enlive-html/zip-select-nodes*/select1/fn  enlive_html.clj:  512
   net.cgrand.enlive-html/zip-select-nodes*/select1  enlive_html.clj:  512
                                            ...                       
                            clojure.core/mapcat         core.clj: 2660
                             clojure.core/apply         core.clj:  630
                               clojure.core/seq         core.clj:  137
                                            ...                       
                            clojure.core/map/fn         core.clj: 2622
 net.cgrand.enlive-html/zip-select-nodes*/select1/fn  enlive_html.clj:  512
    net.cgrand.enlive-html/zip-select-nodes*/select1  enlive_html.clj:  512

Any way to avoid this? Are we just naively recurring somewhere? Can this be turned into a loop/recur?

Thank you!

prismofeverything avatar Nov 04 '15 21:11 prismofeverything

I'm getting this as well. Digging through logs now to find some example data...

retnuh avatar Feb 02 '16 13:02 retnuh

Can you provide a failing gist please?

fdserr avatar Feb 03 '16 08:02 fdserr

https://gist.github.com/retnuh/9747891f2d1fb74e787b

I've stripped down the clojure to more or less bare bones, but haven't had time to dig through the HTML file. I at first thought it might be the STYLE tag outside the HTML tag, but a stripped down version (i.e. most of the body removed) works okay.

bad2.html also triggers StackOverflowError, and it happens much more quickly.

retnuh avatar Feb 03 '16 11:02 retnuh

Thanks Hunter.

but a stripped down version (i.e. most of the body removed) works okay.

The snippet seems alright, but the html file is too large for us to investigate it. It'd be greatly helpful if you could track down where exactly it blows up. Alternatively, try using JSoup as a parser as it is more robust than TagSoup.

fdserr avatar Feb 03 '16 16:02 fdserr