scrapyd
scrapyd copied to clipboard
Cancel does not trigger shutdown handlers (on windows)
When using the cancel REST API method, the crawler process is terminated without calling the registered shutdown handler (spider_closed), at least on Windows. This is my code:
class SpiderCtlExtension(object):
@classmethod
def from_crawler(cls, crawler):
ext = SpiderCtlExtension()
ext.project_name = crawler.settings.get('BOT_NAME')
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_opened(self, spider):
sql = """UPDATE ctl_crawler
SET status = 'RUNNING'
WHERE jobid = '{}' """.format(os.getenv("SCRAPY_JOB"))
engine.execute(sql)
def spider_closed(self,spider,reason):
print "CLOSE SPIDER"
sql = """UPDATE ctl_crawler
SET status = '{}'
WHERE jobid = '{}' """.format(reason.upper(),os.getenv("SCRAPY_JOB"))
engine.execute(sql)
The spider_opened method gets called, the spider_closed method gets called when the crawl is actually finished. However on a cancel, the method is not called.
Another symptom is that the spider's log ends abruptly, without a log entry for the closing event. After going through the sources, I suspect the culprit is actually the way Twisted handles signals on windows:
http://twistedmatrix.com/trac/browser/tags/releases/twisted-8.0.0/twisted/internet/_dumbwin32proc.py#L245
def signalProcess(self, signalID):
if self.closed:
raise error.ProcessExitedAlready()
if signalID in ("INT", "TERM", "KILL"):
win32process.TerminateProcess(self.hProcess, 1)
If I understand correctly what is happening, you have the following setup:
- Twisted container (created in scrapyd/launcher.py)
- Crawler Process (created by scrapy/crawler.py)
- Twisted container
- Crawler
- Twisted container
- Crawler Process (created by scrapy/crawler.py)
The issue here is that the outer Twisted container exits immediately, as also indirectly said here: https://github.com/scrapy/scrapy/issues/1001#issuecomment-68720943
To fix this, it is necessary to somehow trigger a graceful shutdown of the Crawler Process, without terminating the outer container right away.
I agree that graceful shutdown support would be very useful, especially for those relying on signals to spider_closed() for cleanup and summary/stats logging.
Scrapyd appears to accept a signal for cancellation
And here's a peak at twisted's signalProcess() handling
I tried throwing "-d signal=INT" at cancel.json but that didn't produce the desired results.
Does cancel.json do a graceful shutdown on Linux?
hi, is there any solution under this issue?
Hmm, not sure if Windows uses a different signal, like SIGBREAK
?