crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Issue with `screenshot=True` — Capturing Screenshot Twice and Increasing Image Size

Open sufianuddin opened this issue 10 months ago • 5 comments

crawl4ai version

0.4.248

Expected Behavior

Ideally, I would expect the function to capture the screenshot only once without any duplication. Additionally, the image file size should be kept as small as possible while maintaining an acceptable level of quality. If there’s any way to optimize the image size or resolution, that would be ideal.

Current Behavior

I’m facing an issue with the screenshot=True function. It always captures the screenshot twice, and the image size increases significantly (around 60-70 MB in my case).

Image link for reference: Image Link

Has anyone else experienced this or know how to fix it?

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets

from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, \
    CrawlerMonitor, DisplayMode, RateLimiter
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
import asyncio
import base64

browser_conf = BrowserConfig(
    browser_type="chromium",
    headless=False,
    viewport_width=1820,
)

crawler_conf = CrawlerRunConfig(
    screenshot=True,
    pdf=True,
    scan_full_page=True,
    stream=True,  # Enable streaming mode
    screenshot_wait_for=2,
    wait_for_images=True,
    ignore_body_visibility=False,
    magic=True,
    word_count_threshold=200,
    markdown_generator=DefaultMarkdownGenerator(),
    cache_mode=CacheMode.DISABLED,  # Note that 'CacheMode' should be properly defined
    verbose=True
)

dispatcher_conf = MemoryAdaptiveDispatcher(
    memory_threshold_percent=90.0,  # Pause if memory exceeds this
    check_interval=1.0,             # How often to check memory
    max_session_permit=5,          # Maximum concurrent tasks
    rate_limiter=RateLimiter(       # Optional rate limiting
        base_delay=(1.0, 2.0),
        max_delay=30.0,
        max_retries=2
    ),
    monitor=CrawlerMonitor(         # Optional monitoring
        max_visible_rows=15,
        display_mode=DisplayMode.DETAILED
    )
)


async def extract_data():
    print(f"\n--- Extracting Data ---")

    urls = ['https://www.emircom.com/about/']
    async with AsyncWebCrawler(config=browser_conf) as crawler:
        # Use async sleep instead of blocking time.sleep()
        await asyncio.sleep(5)
        # Process results as they complete
        async for result in await crawler.arun_many(urls=urls, config=crawler_conf, dispatcher=dispatcher_conf):
            if result.success:
                # Process each result immediately
                print(result.url, "crawled OK!")
                filename = result.url.split('/')[-2]
                print("Filename: ", filename)

                if result.html:
                    with open(f"D:\Crawl4AI\emircom\data\html\{filename}.html", "w", encoding='utf-8') as ht:
                        ht.write(str(result.html))
                        print("HTML saved as " f"{filename}.html")

                if result.markdown_v2:
                    with open(f"D:\Crawl4AI\emircom\data\markdown\{filename}.md", "w", encoding='utf-8') as md:
                        md.write(str(result.markdown_v2.raw_markdown))
                        print("Markdown saved as " f"{filename}.md")

                if result.extracted_content:
                    with open(f"D:\Crawl4AI\emircom\data\json\{filename}.json", "w", encoding='utf-8') as ec:
                        ec.write(str(result.extracted_content))
                        print("Extracted Content saved as " f"{filename}.json")

                if result.screenshot:
                    with open(f"D:\Crawl4AI\emircom\data\screenshot\{filename}.png", "wb") as f:
                        f.write(base64.b64decode(result.screenshot))
                        print("Screenshot saved as " f"{filename}.png")

                if result.pdf:
                    with open(f"D:\Crawl4AI\emircom\data\pdf\{filename}.pdf", "wb") as f:
                        f.write(result.pdf)
                        print("PDF saved as " f"{filename}.pdf")

            else:
                print("Failed:", result.url, "-", result.error_message)


async def main():
    await extract_data()

# Execute main function
if __name__ == "__main__":
    asyncio.run(main())

OS

Windows

Python version

3.10.10

Browser

Chrome

Browser version

132.0.6834.160

Error logs & Screenshots (if applicable)

No response

sufianuddin avatar Feb 04 '25 14:02 sufianuddin

@sufianuddin We spoke on discord regarding this. I'll check this out today.

aravindkarnam avatar Feb 05 '25 07:02 aravindkarnam

I think for loop inside AsyncPlaywrightCrawlerStrategy.take_screenshot_scroller needs to be like this:

            for i in range(num_segments):
                y_offset = i * viewport_height
                if i == num_segments - 1:
                    last_part_height = page_height % viewport_height
                    if last_part_height == 0:
                        break
                    await page.set_viewport_size({"width": page_width, "height": last_part_height})
                await page.evaluate(f"window.scrollTo(0, {y_offset})")
                await asyncio.sleep(0.01)  # wait for render
                seg_shot = await page.screenshot(full_page=False)
                img = Image.open(BytesIO(seg_shot)).convert("RGB")
                segments.append(img)
            await page.set_viewport_size({"width": page_width, "height": viewport_height})

Because if page_height = n * viewport_height + residual_height, then we need to adjust residual_height, because otherwise await page.evaluate(f"window.scrollTo(0, {y_offset})") cannot be done (y_offset + viewport_height exceeds the page_height).

EducationalPython avatar Feb 06 '25 11:02 EducationalPython

I can confirm this bug and the proposed patch worked for me. I also used streaming, and have crawled Angular websites.

Tauvic avatar Feb 09 '25 09:02 Tauvic

I also had this issue and can confirm @EducationalPython 's fix worked for me. Thank you.

griffin-ezbot avatar Mar 04 '25 02:03 griffin-ezbot

@griffin-ezbot Thanks for confirming. I'll try to apply a patch for this in upcoming alpha release.

aravindkarnam avatar Mar 04 '25 07:03 aravindkarnam

I’ve made the change, and it’ll be included in the upcoming alpha release after v0.5.

cc @aravindkarnam

ntohidi avatar Apr 17 '25 10:04 ntohidi