HandsomeSoup
HandsomeSoup copied to clipboard
Content missing for some websites.
The following URL doesn't seem to load any actual content (just metadata). Some pages on that site seem fine. Any idea what's up?
λ let url = "http://www.goodreads.com/book/show/301538.The_Darkness_That_Comes_Before?from_search=true"
λ runX $ fromUrl url
[NTree (XTag "/" [NTree (XAttr "http-Content-Length") [NTree (XText "386810") []],NTree (XAttr "http-Transfer-Encoding") [NTree (XText "chunked") []],NTree (XAttr "http-Set-Cookie") [NTree (XText "_session_id2=82884d397b7fcd985680433233ba3154; path=/; expires=Fri, 22-Aug-2014 04:20:14 GMT; HttpOnly") []],NTree (XAttr "http-X-Runtime") [NTree (XText "1.612029") []],NTree (XAttr "http-Cache-Control") [NTree (XText "max-age=0, private, must-revalidate") []],NTree (XAttr "http-ETag") [NTree (XText "\"d5ff33fa33ea6cd6c3f85076da8e4132\"") []],NTree (XAttr "http-X-UA-Compatible") [NTree (XText "IE=Edge,chrome=1") []],NTree (XAttr "http-X-Request-Id") [NTree (XText "0VR3CZ02NQRRFSJK9KT3") []],NTree (XAttr "http-Vary") [NTree (XText "User-Agent,Accept-Encoding") []],NTree (XAttr "http-Status") [NTree (XText "200 OK") []],NTree (XAttr "http-Content-Type") [NTree (XText "text/html; charset=utf-8") []],NTree (XAttr "transfer-Encoding") [NTree (XText "UTF-8") []],NTree (XAttr "transfer-MimeType") [NTree (XText "text/html") []],NTree (XAttr "http-Server") [NTree (XText "Server") []],NTree (XAttr "http-Date") [NTree (XText "Thu, 21 Aug 2014 22:20:14 GMT") []],NTree (XAttr "transfer-Version") [NTree (XText "HTTP/1.1") []],NTree (XAttr "transfer-Message") [NTree (XText "OK") []],NTree (XAttr "transfer-Status") [NTree (XText "200") []],NTree (XAttr "transfer-URI") [NTree (XText "http://www.goodreads.com/book/show/301538.The_Darkness_That_Comes_Before?from_search=true") []],NTree (XAttr "source") [NTree (XText "http://www.goodreads.com/book/show/301538.The_Darkness_That_Comes_Before?from_search=true") []]]) []]
That's weird. Those are all the http headers, and content length is 386810, which means the whole page is being sent. Not sure where the body of the response is going.
I tried a workaround like this but looks like there's something up with parsing that document:
λ import Network.HTTP
λ html <- simpleHTTP (getRequest url) >>= getResponseBody
-- html looks correct
λ runX $ parseHtml html >>> css "span"
[]