[BUG] Download PDF throws exception on some URLs
Context:
- Playwright Version: 1.25.2
- Operating System: Linux Ubuntu
- Python version: 3.8.10
- Browser: Firefox, Chromium
- Extra:
Code Snippet
#!/usr/bin/env python3
import asyncio
from playwright.async_api import async_playwright
tracing_enabled = True
tracing_filepath = "trace.zip"
async def handle_download(download):
print("Found download for %s" % download.url)
download_filepath = await download.path()
print("Downloaded %s from %s" % (download_filepath, download.url))
return
async def main():
url1 = "https://www.mandiant.com/sites/default/files/2021-09/mandiant-apt1-report.pdf"
url2="https://057info.hr/doc/o_kolacicima.pdf"
url = url1
async with async_playwright() as p:
browser = await p.firefox.launch(headless=True)
context = await browser.new_context(accept_downloads=True)
if tracing_enabled:
await context.tracing.start(screenshots=True,
snapshots=True,
sources=True)
page = await context.new_page()
page.on('download', handle_download)
print("Visiting %s" % url)
try:
response = await page.goto(url, timeout=0)
except Exception as e:
print("Got exception %s" % e)
await page.close()
if tracing_enabled:
await context.tracing.stop(path = tracing_filepath)
await context.close()
await browser.close()
if __name__ == '__main__':
asyncio.run(main())
Describe the bug
When running the above code with Firefox, url2 downloads correctly, but url1 throws the following exception:
Visiting https://www.mandiant.com/sites/default/files/2021-09/mandiant-apt1-report.pdf
Found download for https://www.mandiant.com/sites/default/files/2021-09/mandiant-apt1-report.pdf
Exception in callback AsyncIOEventEmitter._emit_run.
The trace seems to indicate that download.path() times out in url1, which perhaps is why the smaller PDF in url2 works? However, I do not know how to handle those timeouts (I am passing a timeout of zero for goto).
The report is for Firefox, but using Chromium has a similar exception (it throws an additional exception in goto, but that seems to be expected Chromium behavior according to https://github.com/microsoft/playwright-java/issues/863 and the download still starts if that first exception is caught). Webkit throws a different exception (Frame load interrupted) in both URLs and the download event is not fired.
To give a little bit of context, in my scenario I am given URLs which may point to HTML page or PDF and I need to download both. I cannot use 'async with page.expect_download()' since the URL may directly point to a PDF file.
Thanks for your time. Let me know if you need further info