HandsomeSoup icon indicating copy to clipboard operation
HandsomeSoup copied to clipboard

Content missing for some websites.

Open bobjflong opened this issue 9 years ago • 2 comments

The following URL doesn't seem to load any actual content (just metadata). Some pages on that site seem fine. Any idea what's up?

λ let url = "http://www.goodreads.com/book/show/301538.The_Darkness_That_Comes_Before?from_search=true"

λ runX $ fromUrl url
[NTree (XTag "/" [NTree (XAttr "http-Content-Length") [NTree (XText "386810") []],NTree (XAttr "http-Transfer-Encoding") [NTree (XText "chunked") []],NTree (XAttr "http-Set-Cookie") [NTree (XText "_session_id2=82884d397b7fcd985680433233ba3154; path=/; expires=Fri, 22-Aug-2014 04:20:14 GMT; HttpOnly") []],NTree (XAttr "http-X-Runtime") [NTree (XText "1.612029") []],NTree (XAttr "http-Cache-Control") [NTree (XText "max-age=0, private, must-revalidate") []],NTree (XAttr "http-ETag") [NTree (XText "\"d5ff33fa33ea6cd6c3f85076da8e4132\"") []],NTree (XAttr "http-X-UA-Compatible") [NTree (XText "IE=Edge,chrome=1") []],NTree (XAttr "http-X-Request-Id") [NTree (XText "0VR3CZ02NQRRFSJK9KT3") []],NTree (XAttr "http-Vary") [NTree (XText "User-Agent,Accept-Encoding") []],NTree (XAttr "http-Status") [NTree (XText "200 OK") []],NTree (XAttr "http-Content-Type") [NTree (XText "text/html; charset=utf-8") []],NTree (XAttr "transfer-Encoding") [NTree (XText "UTF-8") []],NTree (XAttr "transfer-MimeType") [NTree (XText "text/html") []],NTree (XAttr "http-Server") [NTree (XText "Server") []],NTree (XAttr "http-Date") [NTree (XText "Thu, 21 Aug 2014 22:20:14 GMT") []],NTree (XAttr "transfer-Version") [NTree (XText "HTTP/1.1") []],NTree (XAttr "transfer-Message") [NTree (XText "OK") []],NTree (XAttr "transfer-Status") [NTree (XText "200") []],NTree (XAttr "transfer-URI") [NTree (XText "http://www.goodreads.com/book/show/301538.The_Darkness_That_Comes_Before?from_search=true") []],NTree (XAttr "source") [NTree (XText "http://www.goodreads.com/book/show/301538.The_Darkness_That_Comes_Before?from_search=true") []]]) []]

bobjflong avatar Aug 21 '14 22:08 bobjflong

That's weird. Those are all the http headers, and content length is 386810, which means the whole page is being sent. Not sure where the body of the response is going.

egonSchiele avatar Aug 22 '14 06:08 egonSchiele

I tried a workaround like this but looks like there's something up with parsing that document:

λ import Network.HTTP

λ html <- simpleHTTP (getRequest url) >>= getResponseBody
-- html looks correct

λ runX $ parseHtml html >>> css "span"
[]

bobjflong avatar Aug 22 '14 08:08 bobjflong