crawl4ai
crawl4ai copied to clipboard
[Bug]: cannot crawl all content while scan_full_page=True
crawl4ai version
0.5.0.post8
Expected Behavior
get all note-item content when scroll to the last page
Current Behavior
only obtain the content of the note-item class on the last page
Is this reproducible?
Yes
Inputs Causing the Bug
URL:https://www.xiaohongshu.com/search_result/?keyword=%25E6%2597%25A0%25E9%2594%25A1%25E6%25B1%2582%25E7%25A7%259F&source=unknown&type=50
Hello, When I scroll through the list page of the website I want to crawl, the number of note-item in the feeds-container area remains fixed. For example, during the scrolling process, although the data-index in the screenshot keeps increasing, the number of note-item in the feeds-container is fixed. In such a situation, how should I crawl the content of all note-item classes during the scrolling process?
When I execute my code, it will automatically scroll to the last page, but in the end, it will only obtain the content of the note-item class on the last page. Since the content of the previous pages is no longer in the feeds-container, I can't get it.
Steps to Reproduce
Code snippets
async def extract_rent_user():
browser_config = BrowserConfig(
# browser_type="chromium",
headless=False,# Set to False if you want to see the browser window
use_managed_browser=True, # Required for persistent profiles
user_data_dir="/root/.crawl4ai/profiles/profile_20250415_120515_1a63da",
)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_for_images=True,
scan_full_page=True, # Tells the crawler to try scrolling the entire page
scroll_delay=0.5, # Delay (seconds) between scroll steps
# verbose=True,
js_code="window.scrollTo(0, document.body.scrollHeight);",
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "XHS Rent User Search Results",
"baseSelector": ".feeds-container .note-item",
"fields": [
{
"name": "user_name",
"selector": ".footer .author .name",
"type": "text"
},
{
"name": "card_time",
"selector": ".footer .author .time",
"type": "text"
},
{
"name": "title",
"selector": ".title span",
"type": "text"
},
{
"name": "room_detail_url",
"selector": ".cover.mask.ld",
"type": "attribute",
"attribute": "href"
},
{
"name": "user_profile",
"selector": ".author-wrapper .author",
"type": "attribute",
"attribute": "href"
},
{
"name": "user_avatar",
"selector": ".author-wrapper .author img",
"type": "attribute",
"attribute": "src"
},
{
"name": "like_count",
"selector": ".like-wrapper.like-active .count",
"type": "text"
}
]
}
)
)
async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs):
"""Hook called after navigating to each URL"""
print(f"[HOOK] after_goto - Successfully loaded: {url}")
try:
# Wait for search box to be available
search_box = await page.wait_for_selector('#search-input', timeout=5000)
print("search_box check OK!")
# Type the search query
await search_box.fill('hhhh')
print("search_box fill OK!")
# Get the search button and prepare for navigation
search_button = await page.wait_for_selector('header .input-button .search-icon', timeout=10000)
print("search_button check OK!")
# time.sleep(60)
# Click with navigation waiting
await search_button.click()
print("search_button click OK!")
time.sleep(10)
# Wait for search results to load
await page.wait_for_selector('[class="search-layout__main"]', timeout=20000)
print("[HOOK] Search completed and results loaded!")
except Exception as e:
print(f"[HOOK] Error during search operation: {str(e)}")
return page
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("after_goto", after_goto)
await crawler.start()
# Extract the data
result = await crawler.arun(url=BASE_URL, config=crawler_config)
# Process and print the results
if result and result.extracted_content:
# Parse the JSON string into a list of products
products = json.loads(result.extracted_content)
# Process each product in the list
for product in products:
print("\nProduct Details:")
print(f"UserName: {product.get('user_name')}")
print(f"CardTime: {product.get('card_time')}")
print(f"UserProfile: {product.get('user_profile')}")
print(f"UserAvatar: {product.get('user_avatar')}")
print(f"Title: {product.get('title')}")
print(f"RoomDetailUrl: {product.get('room_detail_url')}")
print(f"LikeCount: {product.get('like_count')}")
OS
Linux
Python version
3.11.12
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response