crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Slow performance of crawl4AI in Docker compared to pip installation outside Docker environment

Open QuangTQV opened this issue 1 year ago • 3 comments

I am encountering slow performance when using crawl4AI in a Docker environment, whereas when I test it outside of Docker using the regular pip installation, the speed is significantly faster. Could there be any configuration or environment issues causing this discrepancy in performance? Please let me know if there are any errors or optimizations I may have overlooked.

QuangTQV avatar Dec 09 '24 02:12 QuangTQV

@QuangTQV Can you share with me the specs; is it AMD or ARM, and also how much memory you assign to your Docker? Do you know that, and on which hardware are you running it? I'm curious to know it.

unclecode avatar Dec 09 '24 12:12 unclecode

@QuangTQV Can you share with me the specs; is it AMD or ARM, and also how much memory you assign to your Docker? Do you know that, and on which hardware are you running it? I'm curious to know it.

I'm mistaken, sorry. But now, how can I return markdown? I see it only returns HTML, and the markdown parameter is empty.

image

{ "urls": "https://www.dienmayxanh.com/", "word_count_threshold": 1, "extraction_config": { "type": "basic", "params": {} }, "chunking_strategy": { "type": "string", "params": {} }, "content_filter": { "type": "bm25", "params": {} }, "js_code": [ "string" ], "wait_for": "string", "css_selector": "string", "screenshot": false, "magic": false, "extra": {}, "session_id": "string", "cache_mode": "enabled", "priority": 5, "ttl": 3600, "crawler_params": {} }

QuangTQV avatar Dec 12 '24 08:12 QuangTQV

@QuangTQV The code below is how to use the new version:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Configure the browser settings
    browser_config = BrowserConfig(
        headless=True,
        verbose=True,
        user_agent_mode="random",
    )

    # Set run configurations, including cache mode and markdown generator
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=DefaultMarkdownGenerator(
            # content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0),
            # options={"ignore_links": True}
        )
    )

    async with AsyncWebCrawler(browser_config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://www.kidocode.com/degrees/technology',
            config=crawl_config
        )

        if result.success:
            print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
            print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))
            # Fit markdown exists if you pass content filter
            # print("Fit Markdown Length:", len(result.markdown_v2.fit_markdown))

if __name__ == "__main__":
    asyncio.run(main())

unclecode avatar Dec 13 '24 12:12 unclecode