crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

I have some basic questions about using Crawl4AI

Open mozou opened this issue 1 year ago β€’ 9 comments

I'm very sorry, but I still want to ask.

I did some simple learning and understanding at https://crawl4ai.com/mkdocs/.

I tried to crawl some simple pages. I found that because of the built-in interface, it is very easy to get media resources, which is very interesting and great!

But I still have some additional questions to ask.

I used selenium and pyppepeer for crawling before, and this time I used Crawl4AI, but I didn't seem to feel its power. (I didn't use the LLM function this time) Maybe I am a beginner to use crawl4AI.

I found that it simplifies some common problems of crawlers and provides interfaces, but it doesn't seem to be much stronger than traditional crawler frameworks in terms of crawling and verification code processing? Can you tell me its advantages? Thank you.

mozou avatar Dec 26 '24 07:12 mozou

I also think so, and I have encountered many problems during use. The documentation is not clear and I don't know how to solve them

sz0811 avatar Dec 26 '24 08:12 sz0811

@mozou Thanks for trying the library. Could you show me some examples? For instance, define a task that you like to do and explain how you found it easier with other libraries. I want to clarify that I don't compare Crawl4ai with Selenium or Playwright; they serve as wrappers around Chromium, and I find Playwright much faster than Selenium. Crawl4ai generates data suitable for large language models. This data can be structured or high-quality markdown, which has motivated me from the very beginning. I focus on generating markdown quickly while providing developers the ability to intervene at any stage. The library excels at generating markdown efficiently, and I'm also working on making it scalable in the cloud.

To provide a good comparison, please share a specific task you find difficult to accomplish with Crawl4ai, and I will create a version for you to compare. @sz0811 is somehow correct; the documentation currently confuses users due to numerous changes, and it hasn't been updated properly. That's why I've been working heavily on the documentation for the last two weeks, and I will soon update it with more practical examples.

I need help and support from community members because my goal is to ensure this becomes the best available tool for people who need data extraction for AI applications. I appreciate your support. Please share your case study, and I can create a coding snippet for your feedback.

unclecode avatar Dec 26 '24 09:12 unclecode

@unclecode First of all, thank you for your answer.

Because I found that the most common problem in crawling is the anti-crawling strategies on the website. Especially when crawling on some shopping websites or social websites. For example, sliders, or click verification (such as clicking on pictures containing "cats"). These are very troublesome in crawling.

In traditional crawlers, OCR is basically used for processing or other platforms. However, Crawl4AI integrates large models, and perhaps it can be broken through with the power of AI.

I used the traditional crawler framework before, and accidentally found your very interesting project on Github, so I tried to learn and understand it. I did not use the new feature of LLM this time, but I will try it later because this is a very attractive part of Crawl4AI.

mozou avatar Dec 26 '24 09:12 mozou

@unclecode Because my question is based on the comparison with traditional crawler frameworks, it may not be very accurate, after all, Crawl4AI is specially built for AI and LLM.

But encountering various anti-crawling strategies in crawlers is also a common problem for all crawler frameworks. Although Crawl4AI can simulate users to try to avoid risks, it is still very difficult on some websites with strict anti-crawling. If Crawl4AI can handle these problems, it may become the greatest crawler framework in history, and even a beginner can become a crawler master.

I really hope that Crawl4AI can become more powerful, because it is very interesting. and thank you for spending a lot of time to write documents for us beginners.

mozou avatar Dec 26 '24 09:12 mozou

@mozou Thank you for your kind words. You mentioned something crucial, and I will spend a lot of time on that. Right now, we have something called a managed browser. With the managed browser, you can do everything you can do with your personal browser using Crawl4ai. I have detailed multiple GitHub issues and included code examples in the new documentation.

The idea is to open a new browser in your terminal using a command line. Then, you assign a new folder called the user data profile directory or user profile directory to that browser. This action opens a fresh new browser. You can then visit all the pages you want to crawl. If you need to log in, you log in. If you need to bypass anti-crawling gates, you do that as well. Essentially, this is your browser and your identity.

After closing the browser, you start using Crawl4ai and pass the folder. This time, Crawl4ai opens the browser attached to that folder, and you magically have everything you created. You can crawl and act because you are using your own identity, and you deserve it since it's your own data and browser. This is just one of the multiple approaches I incorporated, and I can say it works with the majority of websites.

In the last three to four months, I received many requests, and I have already handled many of them, which made Crawl4ai stand out. I tried to find links to those issues to share with you, and they will also be included in the new documentation. So please stay tuned for that.

unclecode avatar Dec 26 '24 12:12 unclecode

@unclecode Thank you for your answer, I hope you can succeed. I will continue to learn Crawl4AI, I hope it can become more powerful.

mozou avatar Dec 26 '24 14:12 mozou

@unclecode Hello and Gongrats!

I tried several features of the tool, the page interaction, advanced session management and auth crawler strategy by hooks.

I noticed that the media extraction and analysis capabilities structure doesn't detect perfectly in real-world conditions. I mainly refer to Background images where media processing sometimes does not detect properly.

So in that case I tried to figure it out by Pre-Execute my custom js_code to get all background images before the browser return the html. For example, this url https://akispetretzikis.com/recipe/8615/revithada-me-chwriatiko-loukaniko is a simple recipe

I created a js_code.

const backgrounds = new Set(
    [...document.querySelectorAll('*')]
    .map(el => getComputedStyle(el).backgroundImage)
    .filter(img => img && img !== 'none')
);
const result = [...backgrounds];
let url = result[0] || undefined; // get the first bg image to test it out.
if (url) {
    // Extract the first URL with regex
    const matches = url.match(/url\\("https:\\/\\/[^"]+\\.(jpg|jpeg|png|apng|webp|avif|gif)"\\)/gi);
    if (matches) {
    // If matches are found, loop through them and extract the URLs
        matches.forEach(match => {
            // Extract the URL from each match
            const imageUrl = match.match(/url\\("([^"]+)"/i)[1];
            // console.log(imageUrl); // Log each image URL
            // Create an img element and set the src attribute to the extracted image URL
            const img = document.createElement('img');
            img.src = imageUrl; // Set the extracted URL as the img src
            img.classList.add('bg-image-selection');  // Replace with your class name
            document.body.appendChild(img); // Append the img element to the document body
        });
    }  
}

But when I tried to wait_for the condition it was not work properly. like this code

bg_images = ".bg-image-selection"
result = await crawler.arun(
    # session_id=session_id,
    excluded_tags=['header', 'footer', 'nav', 'meta', 'link'],  # Additional tags to remove
    # url="https://akispetretzikis.com/recipe/8733/christougenniatikh-mpolonez-me-kima-galopoulas-kai-kastana",
    url=url,
    js_code=js_code, # where js_code is the previous code content
    process_iframes=True,  # Extract iframe content
    remove_overlay_elements=True,  # Remove popups/modals that might block iframe
    wait_for=f"css:{bg_images}",
    # css_selector=bg_images,
    screenshot=True,

    # Timing
    # delay_before_return_html=3.0,   # Additional wait time
    magic=True,
    scan_full_page=True,   # Enables scrolling
    scroll_delay=0.2, # Waits 200ms between scrolls (optional)
    
    cache_mode=CacheMode.BYPASS,  # New way to handle cache
    wait_for_images=True,  # Add this argument to ensure images are fully loaded
    # Only execute JS without reloading page
    simulate_user=True,      # Simulate human behavior
    override_navigator=True,  # Override navigator properties
    # adjust_viewport_to_content=True,  # Dynamically adjusts the viewport
    # markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
    js_only=True,  # Only execute JS without reloading page
)

# Access different media types
images = result.media["images"]  # List of image details
....

Call log:

  • waiting for locator(".bg-image-selection") to be visible still waiting .. to be visible

prokopis3 avatar Dec 26 '24 23:12 prokopis3

@prokopis3 I will check this URL and get back to you. The desirable output is that we should crawl it, and we will definitely do that. Since day zero, many people have reported different situations, and I keep updating. Hopefully, within a few months, we can say we have covered all the situations. I will check this one and provide an update.

unclecode avatar Dec 27 '24 12:12 unclecode

@prokopis3 I will check this URL and get back to you. The desirable output is that we should crawl it, and we will definitely do that. Since day zero, many people have reported different situations, and I keep updating. Hopefully, within a few months, we can say we have covered all the situations. I will check this one and provide an update.

After a quick review, I am so excited to see such excellent work from you. Congratulations! I noticed that the scraping on images has a success rate about 97-99%.

I consider this one of the best repositories at the moment in this field.

Additionally, I would like to propose some ideas that could help in automation.

For example, there should be a way to recognize elements or enable element tracking after fetching the HTML, as the browser-use automation works.

However, "browser-use" usage consumes a lot of tokens, which is a significant drawback in the browse-less approach.

I would love to contribute to this project, but I must admit I don’t have much experience in this area or with GitHub team workflows.

best regards Well Done! keep going!

prokopis3 avatar Jan 12 '25 15:01 prokopis3

@prokopis3 Thank you so much for your kind words; they really motivate me to know that what I am trying to build is helping people. First of all, you are most welcome to join us, help us, and contribute. If you share your email address, I will send you the invitation link.

I want to discuss the idea you proposed here. My plan is to reduce the dependency on language models for automation. I also plan to fine-tune a small language model that can run locally to generate a JS script we need to complete the tasks, rather than processing all the stages by itself. This approach will be much faster and more efficient.

Additionally, I am glad to know that recent changes have improved your experience with scraping images. And a few changes are coming to make things even faster and more efficient in this version, which I will release in one or two days, version 0.4.248 I will record a video about it.

Thanks again.

unclecode avatar Jan 13 '25 12:01 unclecode