crawl4ai how to use ollama corectly

Excuse me. Here is my a piece of code:

    extraction_strategy = LLMExtractionStrategy(
            provider='ollama_chat/qwen2.5-coder',
            url_base="http://localhost:11434", 
            api_token=os.getenv('OPENAI_API_KEY'),
            schema=KnowledgeGraph.model_json_schema(),
            extraction_type="schema",
            instruction="""Extract entities and relationships from the given text."""
    )
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
            # magic=True
        )

However, its execute result is as below:

Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.
[LOG] 🚀 Crawl4AI 0.3.731
[LOG] 🚀 Crawling done for https://paulgraham.com/love.html, success: True, time taken: 2.72 seconds
[LOG] 🚀 Content extracted for https://paulgraham.com/love.html, success: True, time taken: 0.04 seconds
[LOG] 🔥 Extracting semantic blocks for https://paulgraham.com/love.html, Strategy: AsyncWebCrawler
[LOG] 🔥 Extracting semantic blocks for https://paulgraham.com/love.html, Strategy: AsyncWebCrawler
[LOG] 🔥 Extracting semantic blocks for https://paulgraham.com/love.html, Strategy: AsyncWebCrawler
[LOG] 🔥 Extracting semantic blocks for https://paulgraham.com/love.html, Strategy: AsyncWebCrawler
[LOG] 🔥 Extracting semantic blocks for https://paulgraham.com/love.html, Strategy: AsyncWebCrawler
[LOG] 🔥 Extracting semantic blocks for https://paulgraham.com/love.html, Strategy: AsyncWebCrawler
[LOG] 🔥 Extracting semantic blocks for https://paulgraham.com/love.html, Strategy: AsyncWebCrawler
[LOG] Call LLM for https://paulgraham.com/love.html - block index: 0
[LOG] Call LLM for https://paulgraham.com/love.html - block index: 1
[LOG] Call LLM for https://paulgraham.com/love.html - block index: 2
22:23:17 - LiteLLM:INFO: utils.py:2723 -
LiteLLM completion() model= qwen2.5-coder; provider = ollama_chat
INFO:LiteLLM:
LiteLLM completion() model= qwen2.5-coder; provider = ollama_chat
22:23:17 - LiteLLM:INFO: utils.py:2723 -
LiteLLM completion() model= qwen2.5-coder; provider = ollama_chat
22:23:17 - LiteLLM:INFO: utils.py:2723 -
LiteLLM completion() model= qwen2.5-coder; provider = ollama_chat
INFO:LiteLLM:
LiteLLM completion() model= qwen2.5-coder; provider = ollama_chat
INFO:LiteLLM:
LiteLLM completion() model= qwen2.5-coder; provider = ollama_chat

yes, it seems to wait for return of program util it power off. My questions are:

Ollama and LLM qwen2.5-coder are active corectly. And I can interact with they single and fluently. So How would I use ollama-llm within crawl4ai?
Review from Logs, it seems create six threads or process for each calling of crawler.arun. Could I control the number of threads? (PS: I enjoy crawl4ai by pip rather than source code.) Thank you very much!

Nov 19 '24 14:11 zlonqi

omg. It costed 20 minutes to finish the little job on my laptop. Maybe it worthwhile to control the concurrent resource on non-gpu machine.

Nov 19 '24 15:11 zlonqi

@zlonqi Thx for trying our library, you can pass chunk_token_threshold to your LLMExtraction class. this parameter controls how many chunks you want to send to the LLM at the same time. By default, the value, which has come from the config, is 2048 characters or tokens. So, if you want to make it lesser, you can set a bigger value, and the bigger value will definitely bring down the number of parallel calls to LLM. If you put infinity, it will be only one thread; you can maybe try that one as well. Let me know how it changes. I also try to run the code in a normal CPU and find the best hyper-parameters for this. Oh, and one more thing: make sure that your ulama is running in parallel mode that supports concurrent calls; that is very important, especially on CPU devices.

Nov 20 '24 12:11 unclecode

@zlonqi Thx for trying our library, you can pass chunk_token_threshold to your LLMExtraction class. this parameter controls how many chunks you want to send to the LLM at the same time. By default, the value, which has come from the config, is 2048 characters or tokens. So, if you want to make it lesser, you can set a bigger value, and the bigger value will definitely bring down the number of parallel calls to LLM. If you put infinity, it will be only one thread; you can maybe try that one as well. Let me know how it changes. I also try to run the code in a normal CPU and find the best hyper-parameters for this. Oh, and one more thing: make sure that your ulama is running in parallel mode that supports concurrent calls; that is very important, especially on CPU devices.

Thanks a lot. I tried to set chunk_token_threshold=100000000 in LLMExtractionStrategy. For purpose to reduce thread concurrency, it works. Now it only starts 2 threads. The log as below:

Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.
[LOG] 🚀 Crawl4AI 0.3.731
[LOG] 🚀 Crawling done for https://paulgraham.com/love.html, success: True, time taken: 2.78 seconds
[LOG] 🚀 Content extracted for https://paulgraham.com/love.html, success: True, time taken: 0.05 seconds
[LOG] 🔥 Extracting semantic blocks for https://paulgraham.com/love.html, Strategy: AsyncWebCrawler
[LOG] Call LLM for https://paulgraham.com/love.html - block index: 0
22:41:46 - LiteLLM:INFO: utils.py:2723 - 
LiteLLM completion() model= qwen2.5-coder; provider = ollama_chat
INFO:LiteLLM:
LiteLLM completion() model= qwen2.5-coder; provider = ollama_chat

However, it already costs a long time to finish the job as before, which is about 12 minutes. Meanwhile, It seems more efficient to request to ollama server directly, it costed half of minute. So I got a good solution which is: using crawl4ai to do crawling job only, and processing llm job through ollama's original style. good night and god bless you!

Nov 20 '24 15:11 zlonqi

Can you show me the specifics of your system? Actually, I'm very interested to see how much I can push it and make it faster and efficient on a CPU device. And if you let me know what Ollama model exactly you are using, are you still using that qwen2.5-coder? I'll try to simulate your system specs, and check to know if it's this part of the library or if Ollama works slowly. Please help me on this; definitely, we can make something good out of it.

Nov 23 '24 10:11 unclecode