Issue rendering images

Open AtulKulkarni01 opened this issue 1 year ago • 1 comments

Hey @unclecode, when using crawl4ai to scrape few sites (ex. Lululemon) I'm not able to extract all the images from theroduct site. I noticed that these images are dynamically rendered. I tried using js_code parameter to render all the images related to the product but not all of them are being rendered.

Can you please explain to me, how do you go about figuring out the js_code to render images.

Following is the code I'm currently using

import os, sys
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode

async def main():
    async with AsyncWebCrawler(
            headless=False, 
            verbose=True,
    ) as crawler:
        result = await crawler.arun(
            url="https://shop.lululemon.com/p/mens-jackets-and-outerwear/Wunder-Puff-Jacket-M/_/prod11140197?color=19746",
            cache_mode=CacheMode.BYPASS,
            js_code = """let items = document.querySelectorAll('.dot-V819x'); for (let item of items) { item.click(); }""",
            delay_before_return_html=0.2,
        )
        
        print(len(result.media['images']))
        for img in result.media['images']:
            print(img['src'])   

if __name__ == "__main__":
    asyncio.run(main())

When checking the output terminal not all of the images are present. could you please help me with this issue ?

Thanks in advance !!

Nov 23 '24 11:11 AtulKulkarni01

My General Response to the Question

To figure out the JavaScript needed for specific scenarios, it comes down to experience, trial, and understanding web development. Familiarity with standard design patterns, especially for common use cases like e-commerce, helps anticipate how websites are structured. Advanced web development techniques, such as dynamic and lazy loading, are also crucial because they influence how data is loaded. Knowing these allows me to craft JavaScript to ensure all data is fully loaded in the browser.

My approach is to first browse the website in a normal browser, use developer tools, and experiment in the console. This provides a playground to test JavaScript and confirm it loads the necessary data. Once everything works, I move the tested JavaScript into my crawler library for use, avoiding trial and error in Python. Chrome developer tools make this process much faster and more efficient.

Our Library Plan to Use LLMs and Community

For Crawl4AI, we plan to fine-tune small language models next year to assist with data extraction challenges. These models will analyze the data you need and the problems you're facing, then generate the JavaScript code required to fix them. They’ll work with already crawled HTML and, in some cases, page images. Instead of extracting all the data (which can be slow and token-heavy), the models will provide targeted JS snippets to solve your specific issues.

We also plan to build a community-driven hub where users can contribute JavaScript snippets for popular websites. This shared library will contain pre-built code snippets to help users with common crawling scenarios. You’ll be able to load these directly into your libraries, making crawling more efficient and collaborative.

This combination of LLMs and community contributions is a key part of our roadmap for next year, and I believe it will significantly improve how we handle crawling challenges. I hope this helps clarify our vision!

Nov 27 '24 12:11 unclecode