crawl4ai
crawl4ai copied to clipboard
[Bug]: Virtual scroll not capturing scrolled data
crawl4ai version
0.7
Expected Behavior
When using headless=False, I see that the browser is correctly scrolling on linkedin and new update/post data is being rendered. I expect this data to be appended to my results.html.
Current Behavior
I can only return a total of 10 posts, regardless of how much I scroll. The browser is rendering many more, but this is not reflected in my results.html.
Is this reproducible?
Yes
Inputs Causing the Bug
url: https://www.linkedin.com/company/hitachienergy/
container_selector="html, body",
scroll_count=50,
scroll_by=500,
wait_after_scroll=1
Steps to Reproduce
1. This code runs in a jupyter notebook
2. Use the following code to simulate behaviour
Code snippets
import random
from crawl4ai import AsyncWebCrawler, CacheMode, VirtualScrollConfig
from crawl4ai.async_configs import CrawlerRunConfig, BrowserConfig
import asyncio
USER_AGENTS =[
# Chrome on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
# Firefox on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:134.0) Gecko/20100101 Firefox/134.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0",
# Edge on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/131.0.2903.86",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.2420.81",
# Opera on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 OPR/116.0.0.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 OPR/109.0.0.0",
# Chrome on macOS
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
# Firefox on macOS
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14.7; rv:134.0) Gecko/20100101 Firefox/134.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14.4; rv:124.0) Gecko/20100101 Firefox/124.0",
# Safari on macOS
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Safari/605.1.15",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15",
# Opera on macOS
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 OPR/116.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 OPR/109.0.0.0",
# Chrome on Linux
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
# Firefox on Linux
"Mozilla/5.0 (X11; Linux x86_64; rv:134.0) Gecko/20100101 Firefox/134.0",
"Mozilla/5.0 (X11; Linux i686; rv:124.0) Gecko/20100101 Firefox/124.0",
]
async def scrape_linkedin_page(url: str) -> str:
user_agent = random.choice(USER_AGENTS)
headers = {
"accept-language": random.choice(["en-US,en-GB,en;q=0.9", "en-US,en;q=0.8", "en-GB,en;q=0.7"]),
"accept-encoding": random.choice(["gzip, deflate, br, zstd", "gzip, deflate, zstd", "deflate, br"]),
"referer": random.choice(["https://www.google.com/", "https://www.bing.com/"]),
"cache-control": "no-cache",
"connection": "keep-alive",
"sec-ch-ua-arch": '"arm"',
"sec-ch-ua-bitness": '"64"',
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "cross-site",
"sec-fetch-user": "?1",
"user-agent": user_agent,
}
virtual_config = VirtualScrollConfig(
container_selector="html, body",
scroll_count=50,
scroll_by=500,
wait_after_scroll=1
)
browser_config = BrowserConfig(headless=False, verbose=False, headers=headers)
run_config = CrawlerRunConfig(
delay_before_return_html=0.2,
excluded_tags=['Cookie Policy', 'Privacy Policy'],
page_timeout=12000,
virtual_scroll_config=virtual_config,
magic=True,
remove_overlay_elements=True,
cache_mode=CacheMode.DISABLED,
verbose=False,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
results = await crawler.arun(url, config=run_config)
return results.html
x = await scrape_linkedin_page("https://www.linkedin.com/company/hitachienergy/")
OS
macOS
Python version
3.13
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Hello @Olliejp, from my understanding, you want to get all the post contents for this company. I advise using JsonCssExtractionStrategy for a structured, reliable approach, and it should return JSON output.
you can read more about it in here