scrapy-playwright issues

Cannot download binary file (PDF) with Chromium headless=new mode

13

I am facing an issue when using chromium, when trying to download a PDF file: the response.body is the viewer plugin HTML, not the bytes. There's already a concerned fix...

tommylge

upstream issue

Guideline on how to use scrapy-playwright based on a real corporative use case

3

Hey folks. At the company we work at, BuscoJobs, we applied Scrapy Playwright on 48 spiders. We have created a guideline (in Spanish) to help users get started with this...

fioreagui

documentation

ERROR: Task was destroyed but it is pending!

10

I sometimes get this error when i use scrapy-pilaywright ``` 2023-03-31 09:33:35 [asyncio] ERROR: Task was destroyed but it is pending! source_traceback: Object created at (most recent call last): File...

ma-pony

upstream issue

How to stop closing browser

1

I start one chrome browser at cdp port 40000 i use PLAYWRIGHT_CDP_URL = "http://localhost:40000" in my setting file but every time scrapy start to work, it will create new browser,and...

hackerpayne

deprioritized

images don't appear to get read from the persistent context properly / cached

1

I'm having trouble getting Scrapy + Playwright to respect caches when crawling, when using a persistent context. I've tried to get it down to a minimal example, which you can...

pjlsergeant

upstream issue

Page hangs on function instead of redirecting

I am attempting an SSO login to a website (I have access to this) via scrapy-playwright, and find that my playwright-script hangs when I use `wait_for_function` and this recursively produces...

lime-n

needs more info

How to recreate a context (browser) if it was closed remotely

Greetings! I am using scrapy-playwright along with Selenium Grid browser cluster. And if the crawling process by the spider is delayed - the cluster can forcibly close the session and...

Schtil

Unhandled browser crash event

4

When the chrome is killed or crash, the context will continue create newpage and throw exception: ```log 2023-01-31 19:29:51 [scrapy.core.scraper] ERROR: Error downloading Traceback (most recent call last): File "/home/test/source/test/venv/lib/python3.10/site-packages/twisted/internet/defer.py",...

NiuBlibing

Scrapy-palywright cannot start working if the reactor is already installed

11

``` Python 3.9.13 Daphne 4.0.0 Django 4.1.2 Channels 4.0.0 Scrapy 2.7.0 scrapy-playwright 0.0.22 ``` My settings: ```python DOWNLOAD_HANDLERS = { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", } TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" ``` My...

alosultan

needs more info

awswaf challenge http status 202

6

When awswaf questions the browser, it will return the page to http 202 and modify the page content to javascript. Then the page will initiate the corresponding request. If it...

icaca

needs more info

scrapy-playwright
scrapy-playwright copied to clipboard

Metadata

Cannot download binary file (PDF) with Chromium headless=new mode

Guideline on how to use scrapy-playwright based on a real corporative use case

ERROR: Task was destroyed but it is pending!

How to stop closing browser

images don't appear to get read from the persistent context properly / cached

Page hangs on function instead of redirecting

How to recreate a context (browser) if it was closed remotely

Unhandled browser crash event

Scrapy-palywright cannot start working if the reactor is already installed

awswaf challenge http status 202

← Metadata

Owner

Metadata

scrapy-playwright scrapy-playwright copied to clipboard

Metadata

← Metadata

Owner

Metadata

scrapy-playwright
scrapy-playwright copied to clipboard