crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Virtual scroll not capturing scrolled data

Open Olliejp opened this issue 3 months ago • 1 comments

crawl4ai version

0.7

Expected Behavior

When using headless=False, I see that the browser is correctly scrolling on linkedin and new update/post data is being rendered. I expect this data to be appended to my results.html.

Current Behavior

I can only return a total of 10 posts, regardless of how much I scroll. The browser is rendering many more, but this is not reflected in my results.html.

Is this reproducible?

Yes

Inputs Causing the Bug

url: https://www.linkedin.com/company/hitachienergy/
container_selector="html, body",
scroll_count=50,
scroll_by=500,
wait_after_scroll=1

Steps to Reproduce

1. This code runs in a jupyter notebook
2. Use the following code to simulate behaviour

Code snippets

import random
from crawl4ai import AsyncWebCrawler, CacheMode, VirtualScrollConfig
from crawl4ai.async_configs import CrawlerRunConfig, BrowserConfig
import asyncio

USER_AGENTS =[
            # Chrome on Windows
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
            
            # Firefox on Windows
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:134.0) Gecko/20100101 Firefox/134.0",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0",
            
            # Edge on Windows
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/131.0.2903.86",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.2420.81",
            
            # Opera on Windows
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 OPR/116.0.0.0",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 OPR/109.0.0.0",
            
            # Chrome on macOS
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
            
            # Firefox on macOS
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 14.7; rv:134.0) Gecko/20100101 Firefox/134.0",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 14.4; rv:124.0) Gecko/20100101 Firefox/124.0",
            
            # Safari on macOS
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Safari/605.1.15",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15",
            
            # Opera on macOS
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 OPR/116.0.0.0",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 OPR/109.0.0.0",
            
            # Chrome on Linux
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
            
            # Firefox on Linux
            "Mozilla/5.0 (X11; Linux x86_64; rv:134.0) Gecko/20100101 Firefox/134.0",
            "Mozilla/5.0 (X11; Linux i686; rv:124.0) Gecko/20100101 Firefox/124.0",
        ]

async def scrape_linkedin_page(url: str) -> str:

    user_agent = random.choice(USER_AGENTS)

    headers = {
            "accept-language": random.choice(["en-US,en-GB,en;q=0.9", "en-US,en;q=0.8", "en-GB,en;q=0.7"]),
            "accept-encoding": random.choice(["gzip, deflate, br, zstd", "gzip, deflate, zstd", "deflate, br"]),
            "referer": random.choice(["https://www.google.com/", "https://www.bing.com/"]),
            "cache-control": "no-cache",
            "connection": "keep-alive",
            "sec-ch-ua-arch": '"arm"',
            "sec-ch-ua-bitness": '"64"',
            "sec-ch-ua-mobile": "?0",
            "sec-fetch-dest": "document",
            "sec-fetch-mode": "navigate",
            "sec-fetch-site": "cross-site",
            "sec-fetch-user": "?1",
            "user-agent": user_agent,
        }

    virtual_config = VirtualScrollConfig(
        container_selector="html, body",
        scroll_count=50,
        scroll_by=500,
        wait_after_scroll=1
    )

    browser_config = BrowserConfig(headless=False, verbose=False, headers=headers)
    run_config = CrawlerRunConfig(
    delay_before_return_html=0.2,
    excluded_tags=['Cookie Policy', 'Privacy Policy'], 
    page_timeout=12000,
    virtual_scroll_config=virtual_config,
    magic=True,
    remove_overlay_elements=True,
    cache_mode=CacheMode.DISABLED,
    verbose=False,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        results = await crawler.arun(url, config=run_config)

        return results.html

x = await scrape_linkedin_page("https://www.linkedin.com/company/hitachienergy/")

OS

macOS

Python version

3.13

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Olliejp avatar Sep 24 '25 16:09 Olliejp

Hello @Olliejp, from my understanding, you want to get all the post contents for this company. I advise using JsonCssExtractionStrategy for a structured, reliable approach, and it should return JSON output.

you can read more about it in here

Ahmed-Tawfik94 avatar Nov 05 '25 06:11 Ahmed-Tawfik94