crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Support for Direct HTML Parsing in crawl4ai

Open crelocks opened this issue 1 year ago β€’ 4 comments
trafficstars

I have a specific use case where Cloudflare blocks often prevent successful crawling, making it challenging to bypass with crawl4ai. To handle this, we tried using flare-bypasser to retrieve the raw HTML content, including scripts and other assets.

Feature Request

Rather than providing a URL to crawl4ai, I’d like to directly feed in the raw HTML content retrieved through flare-bypasser. The idea is for crawl4ai to apply the usual schema, strategy, and processing as it would for a URL-based input.

Existing Solution Exploration

I came across a method named aprocess_html in the source code: async_webcrawler.py#L175. However, this method isn't documented, and I'm unsure if it would support this functionality or if it’s meant for a different purpose.

Could you let me know if there's a way to directly pass raw HTML content to crawl4ai? Additionally, if aprocess_html is relevant here, could you provide documentation or guidance on how to use it for this purpose?

Thank you!

crelocks avatar Nov 12 '24 14:11 crelocks

Couldn't wait for your response, but I tried it myself and it worked. Posting it here for anyone else looking to find out a solution to bypass cloudflare. Please use the above logic to get the HTML and below function call to parse it

from crawl4ai.chunking_strategy import RegexChunking

crawler.aprocess_html(
            url=url,
            html=html_data,
            css_selector=css_selector,
            extracted_content=None,
            word_count_threshold=0,
            chunking_strategy=RegexChunking(),
            screenshot=False,
            verbose=True,
            is_cached=False,
            extraction_strategy=extraction_strategy
        )

Closing this for now

crelocks avatar Nov 12 '24 20:11 crelocks

@crelocks Thanks for using our library. I'm glad you requested offline crawling; it's been suggested by other users. We can pass a local file or HTML by using its path as the URL or passing the HTML as a direct string. We'll implement this feature soon. In the meantime, could you share an example link that caused issues for you, requiring a flare bypasser? We've made recent changes, including the managed browser ability, which allows crawling with a custom browser using your own chrome or browser profile. If you provide an example, I'll demonstrate a few ways to do it. We'll also work on implementing your requested feature.

unclecode avatar Nov 13 '24 08:11 unclecode

https://icvcm.org/news-insights/ This is one such example.

crelocks avatar Nov 13 '24 13:11 crelocks

@crelocks In the version 0.3.74, It crawled beautifully in my test. Stay tuned for the release within a day or two, then test and let me know.

import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    os.makedirs(user_data_dir, exist_ok=True)

    async with AsyncWebCrawler(
        verbose=True,
        headless=True,
        # Optional: You can switch to persistant mode to have much stringer anti-bot
        # user_data_dir=user_data_dir,
        # use_persistent_context=True,
        # Optional: You can set custom headers
        # headers={
        #     "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        #     "Accept-Language": "en-US,en;q=0.5",
        #     "Accept-Encoding": "gzip, deflate, br",
        #     "DNT": "1",
        #     "Connection": "keep-alive",
        #     "Upgrade-Insecure-Requests": "1",
        #     "Sec-Fetch-Dest": "document",
        #     "Sec-Fetch-Mode": "navigate",
        #     "Sec-Fetch-Site": "none",
        #     "Sec-Fetch-User": "?1",
        #     "Cache-Control": "max-age=0",
        # }
    ) as crawler:
        url = "https://icvcm.org/news-insights/"
        result = await crawler.arun(
            url,
            bypass_cache=True,
            magic=True, # This is important
        )
        
        assert result.success, f"Failed to crawl {url}: {result.error_message}"
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

unclecode avatar Nov 14 '24 09:11 unclecode

@unclecode it is just weird how some pages worked and some don't. Like from the same website I can't crawl https://icvcm.org/post-sitemap.xml

I tried some other sitemaps from other websites and had the same issue

crelocks avatar Dec 27 '24 12:12 crelocks

@crelocks Try the code below. This should work, although I am using 0.4.24 which will be out in 1 or 2 days. I close the issue but you are welcome to continue if face with any problem.

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Configure the browser settings
    browser_config = BrowserConfig(headless=False, verbose=True)

    # Set run configurations, including cache mode and markdown generator
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://icvcm.org/post-sitemap.xml',
            config=crawl_config
            
        )
        if result.success:
            print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
            print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))


if __name__ == "__main__":
    asyncio.run(main())

Output:

[INIT].... β†’ Crawl4AI 0.4.23
[FETCH]... ↓ https://icvcm.org/post-sitemap.xml... | Status: True | Time: 1.86s
[SCRAPE].. β—† Processed https://icvcm.org/post-sitemap.xml... | Time: 237ms
[COMPLETE] ● https://icvcm.org/post-sitemap.xml... | Status: True | Total: 2.10s
Raw Markdown Length: 12386
Citations Markdown Length: 12346

unclecode avatar Dec 27 '24 12:12 unclecode

Thanks @unclecode for the example. I guess browser_type="firefox" was the issue. It was not working with firefox

crelocks avatar Dec 27 '24 12:12 crelocks

Perfect, so I close the issue but you are welcome to continue if any help you need

unclecode avatar Dec 27 '24 13:12 unclecode

in my case i am getting html from third party external processor(due to some policy issues), so passing html is only possible option for me.

for now aprocess_html seems to be the only possible approch.

Thanks @crelocks for the snippet and @unclecode for the awesome library.

kamit-transient avatar Dec 27 '24 16:12 kamit-transient

@kamit-transient You can also pass raw html to arun() function, so you dont need to call aprocess_html(), please check this issue, I explained: https://github.com/unclecode/crawl4ai/issues/381

unclecode avatar Dec 28 '24 11:12 unclecode