crawl4ai [Bug]: cannot crawl all content while scan_full

[Bug]: cannot crawl all content while scan_full_page=True

Open duanyr opened this issue 7 months ago • 1 comments

crawl4ai version

0.5.0.post8

Expected Behavior

get all note-item content when scroll to the last page

Current Behavior

only obtain the content of the note-item class on the last page

Is this reproducible?

Yes

Inputs Causing the Bug

URL：https://www.xiaohongshu.com/search_result/?keyword=%25E6%2597%25A0%25E9%2594%25A1%25E6%25B1%2582%25E7%25A7%259F&source=unknown&type=50

Hello,  When I scroll through the list page of the website I want to crawl, the number of note-item in the feeds-container area remains fixed. For example, during the scrolling process, although the data-index in the screenshot keeps increasing, the number of note-item in the feeds-container is fixed. In such a situation, how should I crawl the content of all note-item classes during the scrolling process?
When I execute my code, it will automatically scroll to the last page, but in the end, it will only obtain the content of the note-item class on the last page. Since the content of the previous pages is no longer in the feeds-container, I can't get it.

Steps to Reproduce

Code snippets

async def extract_rent_user():
    browser_config = BrowserConfig(
        # browser_type="chromium",
        headless=False,# Set to False if you want to see the browser window
        use_managed_browser=True,  # Required for persistent profiles
        user_data_dir="/root/.crawl4ai/profiles/profile_20250415_120515_1a63da",
    )
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        wait_for_images=True,
        scan_full_page=True,  # Tells the crawler to try scrolling the entire page
        scroll_delay=0.5,  # Delay (seconds) between scroll steps
        # verbose=True,
        js_code="window.scrollTo(0, document.body.scrollHeight);",
        extraction_strategy=JsonCssExtractionStrategy(
            schema={
                "name": "XHS Rent User Search Results",
                "baseSelector": ".feeds-container .note-item",
                "fields": [
                    {
                        "name": "user_name",
                        "selector": ".footer .author .name",
                        "type": "text"
                    },
                    {
                        "name": "card_time",
                        "selector": ".footer .author .time",
                        "type": "text"
                    },
                    {
                        "name": "title",
                        "selector": ".title span",
                        "type": "text"
                    },
                    {
                        "name": "room_detail_url",
                        "selector": ".cover.mask.ld",
                        "type": "attribute",
                        "attribute": "href"
                    },
                    {
                        "name": "user_profile",
                        "selector": ".author-wrapper .author",
                        "type": "attribute",
                        "attribute": "href"
                    },
                    {
                        "name": "user_avatar",
                        "selector": ".author-wrapper .author img",
                        "type": "attribute",
                        "attribute": "src"
                    },
                    {
                        "name": "like_count",
                        "selector": ".like-wrapper.like-active .count",
                        "type": "text"
                    }
                ]
            }
        )
    )

    async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs):
        """Hook called after navigating to each URL"""
        print(f"[HOOK] after_goto - Successfully loaded: {url}")

        try:
            # Wait for search box to be available
            search_box = await page.wait_for_selector('#search-input', timeout=5000)
            print("search_box check OK!")

            # Type the search query
            await search_box.fill('hhhh')
            print("search_box fill OK!")

            # Get the search button and prepare for navigation
            search_button = await page.wait_for_selector('header .input-button .search-icon', timeout=10000)
            print("search_button check OK!")

            # time.sleep(60)

            # Click with navigation waiting
            await search_button.click()
            print("search_button click OK!")

            time.sleep(10)

            # Wait for search results to load
            await page.wait_for_selector('[class="search-layout__main"]', timeout=20000)
            print("[HOOK] Search completed and results loaded!")

        except Exception as e:
            print(f"[HOOK] Error during search operation: {str(e)}")

        return page


    # Use context manager for proper resource handling
    async with AsyncWebCrawler(config=browser_config) as crawler:

        crawler.crawler_strategy.set_hook("after_goto", after_goto)
        await crawler.start()

        # Extract the data
        result = await crawler.arun(url=BASE_URL, config=crawler_config)

        # Process and print the results
        if result and result.extracted_content:
            # Parse the JSON string into a list of products
            products = json.loads(result.extracted_content)
            # Process each product in the list
            for product in products:
                print("\nProduct Details:")
                print(f"UserName: {product.get('user_name')}")
                print(f"CardTime: {product.get('card_time')}")
                print(f"UserProfile: {product.get('user_profile')}")
                print(f"UserAvatar: {product.get('user_avatar')}")
                print(f"Title: {product.get('title')}")
                print(f"RoomDetailUrl: {product.get('room_detail_url')}")
                print(f"LikeCount: {product.get('like_count')}")

OS

Linux

Python version

3.11.12

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

May 08 '25 07:05 duanyr

crawl4ai crawl4ai copied to clipboard

[Bug]: cannot crawl all content while scan_full_page=True

crawl4ai version

Expected Behavior

Current Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

crawl4ai
crawl4ai copied to clipboard