scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

Proxy removes cookies

Open jayavinothmoorthy opened this issue 3 years ago • 5 comments

Without proxy, cookie applied correctly. But when I use proxy (brightdata), then the cookie is not applied. Did I miss anything?

class ScrapyTest(scrapy.Spider):
    name = 'scrapy test'

    def start_requests(self):

        cookies = {
            'cookieconsent_dismissed': 'yes'
        }

        url = 'https://example.com'

        yield scrapy.Request(url, cookies=cookies, meta={"playwright": True}, callback=self.parse)

settings.py

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "proxy": {
        'server': 'http://zproxy.lum-superproxy.io:22225',
        'username': 'lum-customer-user',
        'password': 'password'
    }
}

COOKIES_ENABLED = True

jayavinothmoorthy avatar Aug 09 '22 19:08 jayavinothmoorthy

Thanks for the report, I can reproduce with the following spider and a mitmproxy instance running locally:

from scrapy import Request, Spider

class PlaywrightSpiderWithProxy(Spider):
    name = "proxy-spider"
    custom_settings = {
        "LOG_LEVEL": "INFO",
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            # "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                # on a separate terminal:
                # ./mitmproxy --proxyauth "user:pass"
                "server": "http://127.0.0.1:8080",
                "username": "user",
                "password": "pass",
            },
        },
    }

    def start_requests(self):
        yield Request(
            url="http://httpbin.org/headers",
            meta={"playwright": True},
            cookies={"foo": "bar"},
        )

    def parse(self, response):
        print(response.request.headers["Cookie"])
        print(response.text)

The cookie is in the request headers, however no "Cookie" header was received by the server:

b'foo=bar'
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en", 
    "Cache-Control": "no-cache", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "Pragma": "no-cache", 
    "Proxy-Connection": "keep-alive", 
    "User-Agent": "Scrapy/2.6.0 (+https://scrapy.org)", 
    "X-Amzn-Trace-Id": "Root=1-62f39a25-2d7f27dc07329815654c8f19"
  }
}
</pre></body></html>

Without configuring the proxy:

b'foo=bar'
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en", 
    "Cache-Control": "no-cache", 
    "Content-Length": "0", 
    "Cookie": "foo=bar", 
    "Host": "httpbin.org", 
    "Pragma": "no-cache", 
    "User-Agent": "Scrapy/2.6.0 (+https://scrapy.org)", 
    "X-Amzn-Trace-Id": "Root=1-62f39baa-6929799a29e23a171baf57f3"
  }
}
</pre></body></html>

The interesting thing is that at this point, overrides["headers"]["cookie"] is foo=bar in both cases. Also printing the request headers in the callback shows the expected value as in the output I posted above. I'm going to need to do some further investigation in order to determine if there's anything else the handler is doing that might cause this, or if this is perhaps an upstream issue.

elacuesta avatar Aug 10 '22 11:08 elacuesta

This seems to be an upstream thing, I just opened https://github.com/microsoft/playwright/issues/16439 asking about it. There might be a way to work around this by setting cookies in the context before sending the request. However I'm not sure because cookies are set for the whole context, applying to multiple requests, and that's not necessarily what we want here (clearing/repopulating the context cookies after each request seems like an overkill). Let's just wait and see what the Playwright team says.

elacuesta avatar Aug 11 '22 03:08 elacuesta

@elacuesta I'm experiencing the same issue. What's the proper way to set cookies in the whole context?

blacksteel1288 avatar Feb 11 '24 21:02 blacksteel1288

To set cookies for a whole context at the Playwright level I'd say there are at least 3 ways:

  1. requesting to receive the page object in a callback with playwright_include_page, access the context and use BrowserContext.add_cookies on it
  2. specifying storage_state in the PLAYWRIGHT_CONTEXTS setting
  3. specifying storage_state in the playwright_context_kwargs request meta key

Examples for 2 & 3 can be found in the contexts.py file within the examples directory. There's also an example on accessing the context in a callback for (1) in these lines.

To be clear, I don't know if these methods work to avoid skipping the cookies when using proxies, please report back your findings if you can.

elacuesta avatar Feb 13 '24 18:02 elacuesta