scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

Cannot download binary file (PDF) with Chromium headless=new mode

Open tommylge opened this issue 1 year ago • 13 comments

I am facing an issue when using chromium, when trying to download a PDF file: the response.body is the viewer plugin HTML, not the bytes.

There's already a concerned fix here: https://github.com/scrapy-plugins/scrapy-playwright/commit/0140b90381a0da92194661a0d13b7436661e80a0

It worked for a month, but not anymore, still getting the issue :/

My code hasn't changed since your fix that worked.

The related issue: https://github.com/scrapy-plugins/scrapy-playwright/issues/184

tommylge avatar Nov 15 '23 09:11 tommylge

Please provide a minimal, reproducible example.

elacuesta avatar Nov 15 '23 12:11 elacuesta

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "test_dl"
    handle_httpstatus_list = [403]

    def start_requests(self):
        # GET request
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_goto_kwargs": {
                    "wait_until": "networkidle",
                },
            },
            callback=self.pasrse,
        )

    async def pasrse(self, response):
        print(response.body)

output:

<!DOCTYPE html><html><head></head><body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);"><embed name="4C80DFDA2738145655DE7937BDA51A0F" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="4C80DFDA2738145655DE7937BDA51A0F"></body></html>

instead of bytes

@elacuesta here the minimal, reproducible example.

tommylge avatar Nov 15 '23 16:11 tommylge

Sorry, I cannot reproduce with scrapy-playwright 0.0.33 (3122f9cc8a32694fc2e7cbedc8511ca12e65d6a0).

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "test_dl"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        # "PLAYWRIGHT_BROWSER_TYPE": "firefox",  # same result with chromium and firefox
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_goto_kwargs": {
                    "wait_until": "networkidle",
                },
            },
        )

    async def parse(self, response):
        print("Response body size:", len(response.body))
        print("First bytes:")
        print(response.body[:200])
2023-11-16 13:46:09 [scrapy.core.engine] INFO: Spider opened
2023-11-16 13:46:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-16 13:46:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-16 13:46:09 [scrapy-playwright] INFO: Starting download handler
2023-11-16 13:46:14 [scrapy-playwright] INFO: Launching browser chromium
2023-11-16 13:46:14 [scrapy-playwright] INFO: Browser chromium launched
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> (resource type: document)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://defret.in/assets/certificates/attestation_secnumacademie.pdf>
2023-11-16 13:46:15 [scrapy-playwright] WARNING: Navigating to <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> returned None, the response will have empty headers and status 200
2023-11-16 13:46:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> (referer: None) ['playwright']
Response body size: 1868169
First bytes:
b"%PDF-1.3\n%\xe2\xe3\xcf\xd3\n9 0 obj\n<< /Type /Page /Parent 1 0 R /LastModified (D:20200619180943+02'00') /Resources 2 0 R /MediaBox [0.000000 0.000000 841.890000 595.276000] /CropBox [0.000000 0.000000 841.890000 "
2023-11-16 13:46:15 [scrapy.core.engine] INFO: Closing spider (finished)
$ scrapy version -v
Scrapy       : 2.11.0
lxml         : 4.9.3.0
libxml2      : 2.10.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 22.10.0
Python       : 3.10.0 (default, Oct  8 2021, 09:55:22) [GCC 7.5.0]
pyOpenSSL    : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform     : Linux-5.15.0-79-generic-x86_64-with-glibc2.35

elacuesta avatar Nov 16 '23 16:11 elacuesta

Okay thanks for your fast answer, pretty strange tho, tried with many different versions, always getting the issue. I guess i didn't debug enough yet and so seems like it doesn't comes from scrapy-playwright.

Could you tell us your playwright version please? I'll keep you up to date.

tommylge avatar Nov 16 '23 17:11 tommylge

Could you tell us your playwright version please?

$ playwright --version               
Version 1.39.0

elacuesta avatar Nov 16 '23 19:11 elacuesta

@elacuesta We were able to narrow down the problem to two settings. First, using the new headless mode of Chrome, like this:

PLAYWRIGHT_LAUNCH_OPTIONS = {
      'args': [
          '--headless=new',
      ],
      'ignore_default_args': [
          '--headless',
      ],
}

Removing this doesn't think the problem alone. We had to rollback to the default value of the Scrapy setting REQUEST_FINGERPRINTER_IMPLEMENTATION which is 2.6 : https://docs.scrapy.org/en/latest/topics/request-response.html#request-fingerprinter-implementation

Setting it to 2.7, which seems recommended for new projects, make the problem appear again, whether the new headless chrome mode is enabled or not.

kinoute avatar Nov 18 '23 17:11 kinoute

The REQUEST_FINGERPRINTER_IMPLEMENTATION setting is not relevant here, I tried several settings combinations and it did not change the results. The relevant part is the new Chromium headless mode, enabled as you mentioned:

PLAYWRIGHT_LAUNCH_OPTIONS = {
    'args': ['--headless=new'],
    'ignore_default_args': ['--headless'],
}

This looks like an upstream bug, the download event is not being fired with the new headless mode. I've opened an upstream Playwright issue (https://github.com/microsoft/playwright-python/issues/2169), although I suspect this is actually a Chromium issue.

elacuesta avatar Nov 18 '23 18:11 elacuesta

I just saw the update on your Playwright issue: do you think there is a chance you could integrate in your plug-in one of the workarounds posted to handle this? There are also other workarounds in the issues listed

kinoute avatar Nov 21 '23 12:11 kinoute

I will have to take a look to see if the workaround applies in this case, as it was suggested way before the introduction of the new Chromium headless mode.

elacuesta avatar Nov 21 '23 13:11 elacuesta

Thanks for your help. For now, we try to detect the PDF viewer code when using Chromium and we redirect the download to a non-Playwright spider.

We basically compare the content-type returned by the response headers with the real content-type by analyzing the response.body. If the headers say application/pdf but the body says text/html, we redirect.

kinoute avatar Nov 23 '23 16:11 kinoute

I'm a bit hesitant to include the mentioned workaround in the main package for now, but I realized it's possible to implement it with the existing API though the playwright_page_init_callback meta key. Hope that helps.

import re
import scrapy


async def init_page(page, request):
    async def handle_pdf(route):
        response = await page.context.request.get(route.request)
        await route.fulfill(
            response=response,
            headers={**response.headers, "Content-Disposition": "attachment"},
        )

    await page.route(re.compile(r".*\.pdf"), lambda route: handle_pdf(route))


class PdfSpider(scrapy.Spider):
    name = "pdf"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "args": ["--headless=new"],
            "ignore_default_args": ["--headless"],
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_init_callback": init_page,
            },
        )

    async def parse(self, response):
        print("Response body size:", len(response.body))
        print("First bytes:")
        print(response.body[:200])

elacuesta avatar Nov 28 '23 23:11 elacuesta

Thanks for the code snippet! Unfortunately, it will not work for URLs that don't end with ".pdf" such as "?download=true" etc. We will try to figure something out and keep you updated.

kinoute avatar Nov 29 '23 06:11 kinoute

Yes, that's exactly why I don't want to add the workaround to the main package :pensive:

elacuesta avatar Nov 29 '23 14:11 elacuesta