scrapyd icon indicating copy to clipboard operation
scrapyd copied to clipboard

Cancel does not trigger shutdown handlers (on windows)

Open kutschkem opened this issue 9 years ago • 3 comments

When using the cancel REST API method, the crawler process is terminated without calling the registered shutdown handler (spider_closed), at least on Windows. This is my code:

class SpiderCtlExtension(object):

   @classmethod 
   def from_crawler(cls, crawler):
       ext = SpiderCtlExtension()

       ext.project_name = crawler.settings.get('BOT_NAME')
       crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
       crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

       return ext

   def spider_opened(self, spider):
       sql = """UPDATE ctl_crawler
                SET status = 'RUNNING'
                WHERE jobid = '{}'  """.format(os.getenv("SCRAPY_JOB"))
       engine.execute(sql)

   def spider_closed(self,spider,reason):
       print "CLOSE SPIDER"
       sql = """UPDATE ctl_crawler
                SET status = '{}'
                WHERE jobid = '{}'  """.format(reason.upper(),os.getenv("SCRAPY_JOB"))
       engine.execute(sql)

The spider_opened method gets called, the spider_closed method gets called when the crawl is actually finished. However on a cancel, the method is not called.

Another symptom is that the spider's log ends abruptly, without a log entry for the closing event. After going through the sources, I suspect the culprit is actually the way Twisted handles signals on windows:

http://twistedmatrix.com/trac/browser/tags/releases/twisted-8.0.0/twisted/internet/_dumbwin32proc.py#L245

    def signalProcess(self, signalID):
        if self.closed:
            raise error.ProcessExitedAlready()
        if signalID in ("INT", "TERM", "KILL"):
            win32process.TerminateProcess(self.hProcess, 1)

If I understand correctly what is happening, you have the following setup:

  • Twisted container (created in scrapyd/launcher.py)
    • Crawler Process (created by scrapy/crawler.py)
      • Twisted container
        • Crawler

The issue here is that the outer Twisted container exits immediately, as also indirectly said here: https://github.com/scrapy/scrapy/issues/1001#issuecomment-68720943

To fix this, it is necessary to somehow trigger a graceful shutdown of the Crawler Process, without terminating the outer container right away.

kutschkem avatar Mar 03 '15 11:03 kutschkem

I agree that graceful shutdown support would be very useful, especially for those relying on signals to spider_closed() for cleanup and summary/stats logging.

Scrapyd appears to accept a signal for cancellation

And here's a peak at twisted's signalProcess() handling

I tried throwing "-d signal=INT" at cancel.json but that didn't produce the desired results.

Does cancel.json do a graceful shutdown on Linux?

pwinzer avatar Sep 23 '19 17:09 pwinzer

hi, is there any solution under this issue?

Dashu-Xu avatar May 23 '20 04:05 Dashu-Xu

Hmm, not sure if Windows uses a different signal, like SIGBREAK?

jpmckinney avatar Sep 24 '21 00:09 jpmckinney