dryscrape sess.visit() sometimes hangs

I have a program which cycles through a list of several thousand urls of different domains, calling sess.visit() for each without creating a new session object. Usually after visiting several hundred of these urls, there will be a visit() that does not return. Waiting several hours has no effect - the operation has hung on visit(). When the process is interrupted it displays this trace:

File "/home/user1/projects/MyBot/MyScraper.py", line 50, in Scrape sess.visit(site_url) File "/usr/local/lib/python2.7/dist-packages/dryscrape/session.py", line 35, in visit return self.driver.visit(self.complete_url(url)) File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 211, in visit self.conn.issue_command("Visit", url) File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 429, in issue_command return self._read_response() File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 433, in _read_response result = self._readline() File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 467, in _readline c = self._sock.recv(1)

If then the url that caused the problem is attempted to be visited alone, visit() returns successfully. So the problem does not seem to be related to the url being visited, rather some internal webkit state. The number of iterations before hanging seems random - sometimes it occurs after less than 100 visits, sometimes after several hundred.

Here's a script that visits the same site 1000 times that will probably demonstrate the problem at some point:

from dryscrape import Session from dryscrape.driver.webkit import Driver from webkit_server import InvalidResponseError

link = 'http://insert-some-site-here-that-doesnt-mind-being-hammered.com' sess = Session(driver = Driver()) sess.set_error_tolerant(True) for i in range(1,1000): try: sess.visit(link) sess.wait() print 'Success iteration', i except InvalidResponseError as e: print 'InvalidResponseError:', e

Apr 04 '12 15:04 pommygranite

I can reproduce this. I think it has to do with AJAX calls not finishing. The same problem occurs when visiting Gmail with the original webkit-server, for example. I have to do more investigation to find the root cause and maybe work around it.

Apr 27 '12 12:04 niklasb

This should be fixed in version 0.9

May 10 '14 01:05 niklasb

Hi niklasb, sorry to comment on a closed issue, but this seems to be still happening for me in 0.9.1. However, I don't have any error/log messages to show yet, it just seems to be stuck on visit() for me (I can't/don't know how to interrupt the process as it's part of my init script, I start a server, the init script runs a python script that uses dryscrape to scrape some data, and then it shuts down). I'll update if I have more info again.

Nov 21 '14 02:11 Wysie

@Wysie Does this happen on any website or just a particular one? If the latter, can you give an example URL? If the former, that's weird.

Nov 21 '14 12:11 niklasb

@niklasb It seems to happen on a particular site, but it's inconsistent. Sometimes it works, sometimes it doesn't (basically the page will have some new info at a particular time, and I wake my server up and loop it every 2 minutes and once the new data is in it scrapes it). I did some logging it seems that the issue is with at_xpath and not with visit, will let you know if I have more info. Thanks for your reply.

Nov 24 '14 14:11 Wysie

Can you give an example of a site where this happens sometimes?

On Mon, Nov 24, 2014 at 3:19 PM, Soh Yuan Chin [email protected] wrote:

@niklasb https://github.com/niklasb It seems to happen on a particular site, but it's inconsistent. Sometimes it works, sometimes it doesn't (basically the page will have some new info at a particular time, and I wake my server up and loop it every 2 minutes and once the new data is in it scrapes it). I did some logging it seems that the issue is with at_xpath and not with visit, will let you know if I have more info. Thanks for your reply.

— Reply to this email directly or view it on GitHub https://github.com/niklasb/dryscrape/issues/7#issuecomment-64199409.

Nov 24 '14 16:11 niklasb

@niklasb @Wysie this exact thing happens to me at http://stats.nba.com/league/team/#!/advanced/?DateTo=11%2F3%2F2014

I can usually get like 10 successful scrapes as I work my way through all the dates and then it starts getting hung up, i've been trying to do a sess.reset() and then putting a sleep on before looping back and trying again, but it doesn't seem to help...

Mar 10 '15 22:03 sjsnider

I meet the same problem. I think maybe can add a parameter called wait_timeout. Like Ghost.py will have session.wait_timeout = 20. If exceed the max wait_timeout, it will throw an error

Sep 14 '15 05:09 kaiwang0112006

Hello , i solveld this issue by adding " return None" at the end of the funciton using session and by putting session variable definition in the same function:

import dryscrape: import time

def dostuff(): Session = dryscrape.Session() Session.visit('url') response = Session.body() print (response) return None

dostuff() time.sleep(120) dostuff() (new data will be printed)

Return None ( same as return) will put back the function as it wasn't run before and erase the session object. Tell me if it helped.

Aug 03 '16 21:08 YasserAntonio

webkit_server.InvalidResponseError: {"class":"InvalidResponseError","message":"Unable to load URL: https://www.instagram.com/jessiej/ because of error loading https://www.instagram.com/jessiej/: Unknown error"}

when i visit website with https thsi exception will raise

Mar 09 '17 06:03 for-nia

dryscrape dryscrape copied to clipboard

sess.visit() sometimes hangs

dryscrape
dryscrape copied to clipboard