[Bug]: Issue with `screenshot=True` — Capturing Screenshot Twice and Increasing Image Size
crawl4ai version
0.4.248
Expected Behavior
Ideally, I would expect the function to capture the screenshot only once without any duplication. Additionally, the image file size should be kept as small as possible while maintaining an acceptable level of quality. If there’s any way to optimize the image size or resolution, that would be ideal.
Current Behavior
I’m facing an issue with the screenshot=True function. It always captures the screenshot twice, and the image size increases significantly (around 60-70 MB in my case).
Image link for reference: Image Link
Has anyone else experienced this or know how to fix it?
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, \
CrawlerMonitor, DisplayMode, RateLimiter
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
import asyncio
import base64
browser_conf = BrowserConfig(
browser_type="chromium",
headless=False,
viewport_width=1820,
)
crawler_conf = CrawlerRunConfig(
screenshot=True,
pdf=True,
scan_full_page=True,
stream=True, # Enable streaming mode
screenshot_wait_for=2,
wait_for_images=True,
ignore_body_visibility=False,
magic=True,
word_count_threshold=200,
markdown_generator=DefaultMarkdownGenerator(),
cache_mode=CacheMode.DISABLED, # Note that 'CacheMode' should be properly defined
verbose=True
)
dispatcher_conf = MemoryAdaptiveDispatcher(
memory_threshold_percent=90.0, # Pause if memory exceeds this
check_interval=1.0, # How often to check memory
max_session_permit=5, # Maximum concurrent tasks
rate_limiter=RateLimiter( # Optional rate limiting
base_delay=(1.0, 2.0),
max_delay=30.0,
max_retries=2
),
monitor=CrawlerMonitor( # Optional monitoring
max_visible_rows=15,
display_mode=DisplayMode.DETAILED
)
)
async def extract_data():
print(f"\n--- Extracting Data ---")
urls = ['https://www.emircom.com/about/']
async with AsyncWebCrawler(config=browser_conf) as crawler:
# Use async sleep instead of blocking time.sleep()
await asyncio.sleep(5)
# Process results as they complete
async for result in await crawler.arun_many(urls=urls, config=crawler_conf, dispatcher=dispatcher_conf):
if result.success:
# Process each result immediately
print(result.url, "crawled OK!")
filename = result.url.split('/')[-2]
print("Filename: ", filename)
if result.html:
with open(f"D:\Crawl4AI\emircom\data\html\{filename}.html", "w", encoding='utf-8') as ht:
ht.write(str(result.html))
print("HTML saved as " f"{filename}.html")
if result.markdown_v2:
with open(f"D:\Crawl4AI\emircom\data\markdown\{filename}.md", "w", encoding='utf-8') as md:
md.write(str(result.markdown_v2.raw_markdown))
print("Markdown saved as " f"{filename}.md")
if result.extracted_content:
with open(f"D:\Crawl4AI\emircom\data\json\{filename}.json", "w", encoding='utf-8') as ec:
ec.write(str(result.extracted_content))
print("Extracted Content saved as " f"{filename}.json")
if result.screenshot:
with open(f"D:\Crawl4AI\emircom\data\screenshot\{filename}.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
print("Screenshot saved as " f"{filename}.png")
if result.pdf:
with open(f"D:\Crawl4AI\emircom\data\pdf\{filename}.pdf", "wb") as f:
f.write(result.pdf)
print("PDF saved as " f"{filename}.pdf")
else:
print("Failed:", result.url, "-", result.error_message)
async def main():
await extract_data()
# Execute main function
if __name__ == "__main__":
asyncio.run(main())
OS
Windows
Python version
3.10.10
Browser
Chrome
Browser version
132.0.6834.160
Error logs & Screenshots (if applicable)
No response
@sufianuddin We spoke on discord regarding this. I'll check this out today.
I think for loop inside AsyncPlaywrightCrawlerStrategy.take_screenshot_scroller needs to be like this:
for i in range(num_segments):
y_offset = i * viewport_height
if i == num_segments - 1:
last_part_height = page_height % viewport_height
if last_part_height == 0:
break
await page.set_viewport_size({"width": page_width, "height": last_part_height})
await page.evaluate(f"window.scrollTo(0, {y_offset})")
await asyncio.sleep(0.01) # wait for render
seg_shot = await page.screenshot(full_page=False)
img = Image.open(BytesIO(seg_shot)).convert("RGB")
segments.append(img)
await page.set_viewport_size({"width": page_width, "height": viewport_height})
Because if page_height = n * viewport_height + residual_height, then we need to adjust residual_height, because otherwise await page.evaluate(f"window.scrollTo(0, {y_offset})") cannot be done (y_offset + viewport_height exceeds the page_height).
I can confirm this bug and the proposed patch worked for me. I also used streaming, and have crawled Angular websites.
I also had this issue and can confirm @EducationalPython 's fix worked for me. Thank you.
@griffin-ezbot Thanks for confirming. I'll try to apply a patch for this in upcoming alpha release.
I’ve made the change, and it’ll be included in the upcoming alpha release after v0.5.
cc @aravindkarnam