crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: When extracting data with scroll_full_page, only the final elements get parsed

Open Popeyef5 opened this issue 9 months ago • 4 comments

crawl4ai version

0.4.248

Expected Behavior

I'm crawiling Twitter, specifically the "following" section of a profile. I have a css selector for the relevant data (user's names and bios) and set up a JsonCssExtractionStrategy. If I don't use scroll_full_page, I understandably only expect to get the first N user profiles. But if I do enable scroll_full_page, I expect the returned data to contain the list to the fullest extent as visible when browsing manually.

Current Behavior

When not using scroll_full_page, I do get the first 16 profiles in this case. However, when setting scroll_full_page, I only get the LAST 12. It's important to note that there are over 40 profiles listed, so none of the 12 profiles intersect with the first 16. I checked the result's html property and it does only contain information about the last 12. However, strangely the screenshot saved contains all the profiles.

Is this reproducible?

Yes

Inputs Causing the Bug

https://x.com/SomeTwitterProfile/following

Steps to Reproduce

Execute the following snippet with both scoll_full_page on and off.

Code snippets

site_url = "https://x.com/elonmusk/following"

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json


async def main():
    schema = {
        "name": "Followers",
        "baseSelector": "button[data-testid='UserCell']",
        "fields": [
            {
              "name": "name",
              "selector": "span",
              "type": "text"
	    },
            {
              "name": "handle",
              "selector": 'a[role="link"] > div > div[dir="ltr"]:only-child > span ',
              "type": "text"
	    },
            {
              "name": "bio",
              "selector": "div[dir='auto'] > span",
              "type": "html",
	     }
                        
	 ]
    }
    
    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
    
    browser_conf = BrowserConfig(
        extra_args=['--disable-web-security'],
        cookies=[
            {"name": "auth_token", "value": "YOURAUTHTOKEN", "domain": ".x.com", "path": "/"},
        ],
    )
    
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS, #        cache_mode=CacheMode.BYPASS/DISABLED,
        screenshot=True,
        wait_for="css:button[data-testid='UserCell']:nth-child(1)",
        page_timeout=5000,
        extraction_strategy=extraction_strategy,
        # scan_full_page=True,
        scroll_delay=1.5,
    )

    try:
        async with AsyncWebCrawler(
            config=browser_conf,
            verbose=True,
        ) as crawler:
            result = await crawler.arun(
                url=site_url ,
                config=crawler_config,
            )
                
            from base64 import b64decode
            with open("screenshot.png", "wb") as f:
              f.write(b64decode(result.screenshot))
              
            with open("index.html", "w") as f:
              f.write(result.html)
                                
            data = json.loads(result.extracted_content)
            print(f"Extracted {len(data)} users")
            print(json.dumps(data, indent=2) if data else "No data found")
                    
    except Exception as e:
        print(f"Something happened: {e}")
        raise e

if __name__ == "__main__":
    asyncio.run(main())

OS

Linux

Python version

3.12.9

Browser

Default

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Popeyef5 avatar Feb 20 '25 22:02 Popeyef5

@Popeyef5 I'm trying to troubleshoot this. Actually running into some wierd CORS errors, even with --disable-web-security passed as extra_args. I'm not sure how you got past that. Anyway we are trying to figure out a solution. I'll keep you updated.

@unclecode This was the issue I referred in our browser profiles chat.

aravindkarnam avatar Mar 04 '25 14:03 aravindkarnam

@Popeyef5 I will be checking your code later, but based on experience, what you’re encountering is likely a common technique used in infinite scrolling websites.

Many modern websites optimize memory usage by keeping a fixed number of UI elements visible, regardless of how much data exists. This prevents excessive memory consumption.

For example: • If a site has thousands of posts, but a user only sees 10 at a time, the page may only render 30 elements (10 currently visible + 10 before + 10 after). • As you scroll, these same 30 elements get reused, dynamically replacing old content with new data.

How to Handle It? Since scrolling alone won’t load all items into the DOM, you need to: 1/ Extract the initial HTML using JSON extraction strategy. 2/ Execute JavaScript to scroll and trigger the next batch of items. 3/ Extract HTML again, accumulating new data at each iteration. 4/ Repeat the process until you reach the end.

Btw I’m planning to introduce built-in scroll-based extraction techniques in Crawl4AI, which will:

  • Extract while scrolling, rather than jumping too far.
  • Not just scroll, but accumulate HTML progressively.

For now, you can verify this behavior using Chrome DevTools—you’ll likely see a fixed number of elements in the DOM that update dynamically as you scroll.

Hope this helps!

unclecode avatar Mar 04 '25 16:03 unclecode

any update on this? @unclecode

dchang10 avatar Apr 29 '25 03:04 dchang10

hi @dchang10 , we're currently working on making infinite scrolling a built-in feature in Crawl4AI. Hopefully, it’ll be available by the end of this quarter.

ntohidi avatar May 08 '25 14:05 ntohidi

Hi I had the same problem and I did the following, this way you load all the elements by counting the number of classes

while True: print(f"\n--- Iteration {iteration} ---")

        # --- Parse HTML and count the target class ---
        soup = BeautifulSoup(str(result.html), 'html.parser')
        # Use soup.select() which returns a list of all matching elements
        elements = soup.find_all('a',target_class_selector)
        current_class_count = len(elements)
        
        print(f"Found {current_class_count} elements with selector '{target_class_selector}'.")

        # --- Check if the count has stabilized ---
        if current_class_count == last_class_count:
            print("Element count has stabilized. Reached the end of the scroll.")
            break
        
        # Update the count for the next iteration
        last_class_count = current_class_count

        # --- Scroll down to load more content ---
        print("Scrolling down...")
        scroll_config = CrawlerRunConfig(
            session_id=SESSION_ID,
            js_code="window.scrollTo(0, document.body.scrollHeight-100);",
            js_only=True, # Important: operate on the same page
            delay_before_return_html =5, # Wait 3s for new content. Adjust as needed.
        )
        
        result = await crawler.arun(url, config=scroll_config)
        iteration += 1

then I used a new configuration as the below

crawler_config = CrawlerRunConfig( session_id=SESSION_ID, cache_mode=CacheMode.BYPASS, extraction_strategy=extraction_strategy, scan_full_page=True

where I enforced the scan_full_page.

Let me know if this works out.

gnikoloudis avatar Jun 17 '25 09:06 gnikoloudis

@unclecode @ntohidi Hi all just want to follow up on this, is this feature still in the plans for this quarter? It would be very helpful!

mkaesler44 avatar Jun 27 '25 20:06 mkaesler44