scrapy-playwright
scrapy-playwright copied to clipboard
Does not shutdown cleanly on SIGINT (cmd+c)
It seems that when using scrapy-playwright Scrapy will not shut down cleanly on SIGINT (cmd+c
), and you have to force a shutdown with a second cmd+c
. If you use the telnet client to run engine.stop()
it does seem to be shutting down cleanly. This is needed in order to save the current state for resuming, an unclean shutdown does not save the current state and cannot be resumed. It stops scraping but continues to log its stats every minute.
On some investigation Playwright has a launch arg handle_sigint
which I believe indicates to forward a SIGINT to the browser process. I thought it may have been this forwarding of the SIGINT that was causing the hang in the handling of shutdown. When I set this to false
(it defaults to true
) with PLAYWRIGHT_LAUNCH_OPTIONS
it still doesn't shutdown cleanly, however all chromium processes are stoped but Scrapy does not shut down and there are no errors, even during a forced shutdown. It just continues to report its stats every minute showing 0 more pages scraped.
I'm happy to continue investigating how to fix this but would appreciate any pointers in the right direction as to where the issue may be arising.
Edit:
Just want to add I am confident that what is happening is that the SIGINT is still passed on to the browser. With engine.stop()
via telnet I see a very graceful shut down (up to a couple of minutes long) with pages currently being processed in my parse
call-backs all finishing. SIGINT/cmd+c just immediately kills the browser processes.
So to answer my own question, I have solved it.
This is a problem upstream with Playwright, currently there isn't a way to prevent a sigint being passed down to playwright. handle_sigint=False
stops playwright from detecting the sigint and doing some of its own handling, that probably wants turning of when used with Scrapy, I have set it false.
The other thing I have done is monkey patch playwright to stop passing down the sigint:
# Monkeypatch Playwright
from playwright._impl._transport import PipeTransport, _get_stderr_fileno
def new_pipe_transport_connect_preexec(): # Don't forward signals.
os.setpgrp()
async def new_pipe_transport_connect(self):
self._stopped_future: asyncio.Future = asyncio.Future()
# Hide the command-line window on Windows when using Pythonw.exe
creationflags = 0
if sys.platform == "win32" and sys.stdout is None:
creationflags = subprocess.CREATE_NO_WINDOW
try:
# For pyinstaller
env = os.environ.copy()
if getattr(sys, "frozen", False):
env.setdefault("PLAYWRIGHT_BROWSERS_PATH", "0")
self._proc = await asyncio.create_subprocess_exec(
str(self._driver_executable),
"run-driver",
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=_get_stderr_fileno(),
limit=32768,
creationflags=creationflags,
env=env,
preexec_fn=new_pipe_transport_connect_preexec, # My new line!!
)
except Exception as exc:
self.on_error_future.set_exception(exc)
raise exc
self._output = self._proc.stdin
PipeTransport.connect = new_pipe_transport_connect
This is inspired by https://stackoverflow.com/a/5446982/18244376 and only worked on Posix.
Going to file a bug with Playwright asking for an api to enable this.
Thank you for the research and for opening the upstream feature request, let's wait and see what they suggest.