crawl4ai
crawl4ai copied to clipboard
Support for Direct HTML Parsing in crawl4ai
I have a specific use case where Cloudflare blocks often prevent successful crawling, making it challenging to bypass with crawl4ai. To handle this, we tried using flare-bypasser to retrieve the raw HTML content, including scripts and other assets.
Feature Request
Rather than providing a URL to crawl4ai, Iβd like to directly feed in the raw HTML content retrieved through flare-bypasser. The idea is for crawl4ai to apply the usual schema, strategy, and processing as it would for a URL-based input.
Existing Solution Exploration
I came across a method named aprocess_html in the source code: async_webcrawler.py#L175. However, this method isn't documented, and I'm unsure if it would support this functionality or if itβs meant for a different purpose.
Could you let me know if there's a way to directly pass raw HTML content to crawl4ai? Additionally, if aprocess_html is relevant here, could you provide documentation or guidance on how to use it for this purpose?
Thank you!
Couldn't wait for your response, but I tried it myself and it worked. Posting it here for anyone else looking to find out a solution to bypass cloudflare. Please use the above logic to get the HTML and below function call to parse it
from crawl4ai.chunking_strategy import RegexChunking
crawler.aprocess_html(
url=url,
html=html_data,
css_selector=css_selector,
extracted_content=None,
word_count_threshold=0,
chunking_strategy=RegexChunking(),
screenshot=False,
verbose=True,
is_cached=False,
extraction_strategy=extraction_strategy
)
Closing this for now
@crelocks Thanks for using our library. I'm glad you requested offline crawling; it's been suggested by other users. We can pass a local file or HTML by using its path as the URL or passing the HTML as a direct string. We'll implement this feature soon. In the meantime, could you share an example link that caused issues for you, requiring a flare bypasser? We've made recent changes, including the managed browser ability, which allows crawling with a custom browser using your own chrome or browser profile. If you provide an example, I'll demonstrate a few ways to do it. We'll also work on implementing your requested feature.
https://icvcm.org/news-insights/ This is one such example.
@crelocks In the version 0.3.74, It crawled beautifully in my test. Stay tuned for the release within a day or two, then test and let me know.
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
async with AsyncWebCrawler(
verbose=True,
headless=True,
# Optional: You can switch to persistant mode to have much stringer anti-bot
# user_data_dir=user_data_dir,
# use_persistent_context=True,
# Optional: You can set custom headers
# headers={
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
# "Accept-Language": "en-US,en;q=0.5",
# "Accept-Encoding": "gzip, deflate, br",
# "DNT": "1",
# "Connection": "keep-alive",
# "Upgrade-Insecure-Requests": "1",
# "Sec-Fetch-Dest": "document",
# "Sec-Fetch-Mode": "navigate",
# "Sec-Fetch-Site": "none",
# "Sec-Fetch-User": "?1",
# "Cache-Control": "max-age=0",
# }
) as crawler:
url = "https://icvcm.org/news-insights/"
result = await crawler.arun(
url,
bypass_cache=True,
magic=True, # This is important
)
assert result.success, f"Failed to crawl {url}: {result.error_message}"
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
@unclecode it is just weird how some pages worked and some don't. Like from the same website I can't crawl https://icvcm.org/post-sitemap.xml
I tried some other sitemaps from other websites and had the same issue
@crelocks Try the code below. This should work, although I am using 0.4.24 which will be out in 1 or 2 days. I close the issue but you are welcome to continue if face with any problem.
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Configure the browser settings
browser_config = BrowserConfig(headless=False, verbose=True)
# Set run configurations, including cache mode and markdown generator
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://icvcm.org/post-sitemap.xml',
config=crawl_config
)
if result.success:
print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))
if __name__ == "__main__":
asyncio.run(main())
Output:
[INIT].... β Crawl4AI 0.4.23
[FETCH]... β https://icvcm.org/post-sitemap.xml... | Status: True | Time: 1.86s
[SCRAPE].. β Processed https://icvcm.org/post-sitemap.xml... | Time: 237ms
[COMPLETE] β https://icvcm.org/post-sitemap.xml... | Status: True | Total: 2.10s
Raw Markdown Length: 12386
Citations Markdown Length: 12346
Thanks @unclecode for the example. I guess browser_type="firefox" was the issue. It was not working with firefox
Perfect, so I close the issue but you are welcome to continue if any help you need
in my case i am getting html from third party external processor(due to some policy issues), so passing html is only possible option for me.
for now aprocess_html seems to be the only possible approch.
Thanks @crelocks for the snippet and @unclecode for the awesome library.
@kamit-transient You can also pass raw html to arun() function, so you dont need to call aprocess_html(), please check this issue, I explained: https://github.com/unclecode/crawl4ai/issues/381