Slow performance of crawl4AI in Docker compared to pip installation outside Docker environment
I am encountering slow performance when using crawl4AI in a Docker environment, whereas when I test it outside of Docker using the regular pip installation, the speed is significantly faster. Could there be any configuration or environment issues causing this discrepancy in performance? Please let me know if there are any errors or optimizations I may have overlooked.
@QuangTQV Can you share with me the specs; is it AMD or ARM, and also how much memory you assign to your Docker? Do you know that, and on which hardware are you running it? I'm curious to know it.
@QuangTQV Can you share with me the specs; is it AMD or ARM, and also how much memory you assign to your Docker? Do you know that, and on which hardware are you running it? I'm curious to know it.
I'm mistaken, sorry. But now, how can I return markdown? I see it only returns HTML, and the markdown parameter is empty.
{ "urls": "https://www.dienmayxanh.com/", "word_count_threshold": 1, "extraction_config": { "type": "basic", "params": {} }, "chunking_strategy": { "type": "string", "params": {} }, "content_filter": { "type": "bm25", "params": {} }, "js_code": [ "string" ], "wait_for": "string", "css_selector": "string", "screenshot": false, "magic": false, "extra": {}, "session_id": "string", "cache_mode": "enabled", "priority": 5, "ttl": 3600, "crawler_params": {} }
@QuangTQV The code below is how to use the new version:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Configure the browser settings
browser_config = BrowserConfig(
headless=True,
verbose=True,
user_agent_mode="random",
)
# Set run configurations, including cache mode and markdown generator
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
# content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0),
# options={"ignore_links": True}
)
)
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
result = await crawler.arun(
url='https://www.kidocode.com/degrees/technology',
config=crawl_config
)
if result.success:
print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))
# Fit markdown exists if you pass content filter
# print("Fit Markdown Length:", len(result.markdown_v2.fit_markdown))
if __name__ == "__main__":
asyncio.run(main())