scrapy-playwright
scrapy-playwright copied to clipboard
Proxy removes cookies
Without proxy, cookie applied correctly. But when I use proxy (brightdata), then the cookie is not applied. Did I miss anything?
class ScrapyTest(scrapy.Spider):
name = 'scrapy test'
def start_requests(self):
cookies = {
'cookieconsent_dismissed': 'yes'
}
url = 'https://example.com'
yield scrapy.Request(url, cookies=cookies, meta={"playwright": True}, callback=self.parse)
settings.py
PLAYWRIGHT_LAUNCH_OPTIONS = {
"proxy": {
'server': 'http://zproxy.lum-superproxy.io:22225',
'username': 'lum-customer-user',
'password': 'password'
}
}
COOKIES_ENABLED = True
Thanks for the report, I can reproduce with the following spider and a mitmproxy instance running locally:
from scrapy import Request, Spider
class PlaywrightSpiderWithProxy(Spider):
name = "proxy-spider"
custom_settings = {
"LOG_LEVEL": "INFO",
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
# "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"proxy": {
# on a separate terminal:
# ./mitmproxy --proxyauth "user:pass"
"server": "http://127.0.0.1:8080",
"username": "user",
"password": "pass",
},
},
}
def start_requests(self):
yield Request(
url="http://httpbin.org/headers",
meta={"playwright": True},
cookies={"foo": "bar"},
)
def parse(self, response):
print(response.request.headers["Cookie"])
print(response.text)
The cookie is in the request headers, however no "Cookie" header was received by the server:
b'foo=bar'
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en",
"Cache-Control": "no-cache",
"Content-Length": "0",
"Host": "httpbin.org",
"Pragma": "no-cache",
"Proxy-Connection": "keep-alive",
"User-Agent": "Scrapy/2.6.0 (+https://scrapy.org)",
"X-Amzn-Trace-Id": "Root=1-62f39a25-2d7f27dc07329815654c8f19"
}
}
</pre></body></html>
Without configuring the proxy:
b'foo=bar'
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en",
"Cache-Control": "no-cache",
"Content-Length": "0",
"Cookie": "foo=bar",
"Host": "httpbin.org",
"Pragma": "no-cache",
"User-Agent": "Scrapy/2.6.0 (+https://scrapy.org)",
"X-Amzn-Trace-Id": "Root=1-62f39baa-6929799a29e23a171baf57f3"
}
}
</pre></body></html>
The interesting thing is that at this point, overrides["headers"]["cookie"] is foo=bar in both cases. Also printing the request headers in the callback shows the expected value as in the output I posted above.
I'm going to need to do some further investigation in order to determine if there's anything else the handler is doing that might cause this, or if this is perhaps an upstream issue.
This seems to be an upstream thing, I just opened https://github.com/microsoft/playwright/issues/16439 asking about it. There might be a way to work around this by setting cookies in the context before sending the request. However I'm not sure because cookies are set for the whole context, applying to multiple requests, and that's not necessarily what we want here (clearing/repopulating the context cookies after each request seems like an overkill). Let's just wait and see what the Playwright team says.
@elacuesta I'm experiencing the same issue. What's the proper way to set cookies in the whole context?
To set cookies for a whole context at the Playwright level I'd say there are at least 3 ways:
- requesting to receive the page object in a callback with
playwright_include_page, access the context and useBrowserContext.add_cookieson it - specifying
storage_statein thePLAYWRIGHT_CONTEXTSsetting - specifying
storage_statein theplaywright_context_kwargsrequest meta key
Examples for 2 & 3 can be found in the contexts.py file within the examples directory. There's also an example on accessing the context in a callback for (1) in these lines.
To be clear, I don't know if these methods work to avoid skipping the cookies when using proxies, please report back your findings if you can.