[BUG] Download PDF throws exception on some URLs

Open malicialab opened this issue 3 years ago • 0 comments

Context:

Playwright Version: 1.25.2
Operating System: Linux Ubuntu
Python version: 3.8.10
Browser: Firefox, Chromium
Extra:

Code Snippet

#!/usr/bin/env python3

import asyncio
from playwright.async_api import async_playwright

tracing_enabled = True
tracing_filepath = "trace.zip"

async def handle_download(download):
    print("Found download for %s" % download.url)
    download_filepath = await download.path()
    print("Downloaded %s from %s" % (download_filepath, download.url))
    return

async def main():
    url1 = "https://www.mandiant.com/sites/default/files/2021-09/mandiant-apt1-report.pdf"
    url2="https://057info.hr/doc/o_kolacicima.pdf"
    url = url1
    async with async_playwright() as p:
        browser = await p.firefox.launch(headless=True)
        context = await browser.new_context(accept_downloads=True)
        if tracing_enabled:
            await context.tracing.start(screenshots=True,
                                        snapshots=True,
                                        sources=True)

        page = await context.new_page()
        page.on('download', handle_download)

        print("Visiting %s" % url)
        try:
            response = await page.goto(url, timeout=0)
        except Exception as e:
            print("Got exception %s" % e)

        await page.close()
        if tracing_enabled:
            await context.tracing.stop(path = tracing_filepath)
        await context.close()
        await browser.close()

if __name__ == '__main__':
    asyncio.run(main())

Describe the bug

When running the above code with Firefox, url2 downloads correctly, but url1 throws the following exception:

Visiting https://www.mandiant.com/sites/default/files/2021-09/mandiant-apt1-report.pdf Found download for https://www.mandiant.com/sites/default/files/2021-09/mandiant-apt1-report.pdf Exception in callback AsyncIOEventEmitter._emit_run.._callback(<Task finishe...ot NoneType')>) at /home/USER/python-virtual-environments/async/lib/python3.8/site-packages/pyee/_asyncio.py:55 handle: <Handle AsyncIOEventEmitter._emit_run.._callback(<Task finishe...ot NoneType')>) at /home/USER/python-virtual-environments/async/lib/python3.8/site-packages/pyee/_asyncio.py:55> Traceback (most recent call last): File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run self._context.run(self._callback, *self._args) File "/home/user/python-virtual-environments/async/lib/python3.8/site-packages/pyee/_asyncio.py", line 62, in _callback self.emit('error', exc) File "/home/user/python-virtual-environments/async/lib/python3.8/site-packages/pyee/_base.py", line 116, in emit self._emit_handle_potential_error(event, args[0] if args else None) File "/home/USER/python-virtual-environments/async/lib/python3.8/site-packages/pyee/_base.py", line 86, in _emit_handle_potential_error raise error File "./test.py", line 11, in handle_download download_filepath = await download.path() File "/home/USER/python-virtual-environments/async/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 5640, in path return mapping.from_maybe_impl(await self._impl_obj.path()) File "/home/USER/python-virtual-environments/async/lib/python3.8/site-packages/playwright/_impl/_download.py", line 58, in path return await self._artifact.path_after_finished() File "/home/USER/python-virtual-environments/async/lib/python3.8/site-packages/playwright/_impl/_artifact.py", line 36, in path_after_finished return pathlib.Path(await self._channel.send("pathAfterFinished")) File "/usr/lib/python3.8/pathlib.py", line 1042, in new self = cls._from_parts(args, init=False) File "/usr/lib/python3.8/pathlib.py", line 683, in _from_parts drv, root, parts = self._parse_args(args) File "/usr/lib/python3.8/pathlib.py", line 667, in _parse_args a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not NoneType

The trace seems to indicate that download.path() times out in url1, which perhaps is why the smaller PDF in url2 works? However, I do not know how to handle those timeouts (I am passing a timeout of zero for goto).

The report is for Firefox, but using Chromium has a similar exception (it throws an additional exception in goto, but that seems to be expected Chromium behavior according to https://github.com/microsoft/playwright-java/issues/863 and the download still starts if that first exception is caught). Webkit throws a different exception (Frame load interrupted) in both URLs and the download event is not fired.

To give a little bit of context, in my scenario I am given URLs which may point to HTML page or PDF and I need to download both. I cannot use 'async with page.expect_download()' since the URL may directly point to a PDF file.

Thanks for your time. Let me know if you need further info

Sep 20 '22 08:09 malicialab