crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

crawling with Ollama uses OpenAI and requires API_token to be set

Open Praj-17 opened this issue 1 year ago • 1 comments

Hello I am trying to use crawl4ai with Ollama as backend as listed in the in the config.py and mentioned on the providers page of Ollama. Note that I already have an OpenAI key set to the env, hence I am popping it , so that craw4ai doesn't use it under the hood.

Here is what I am doing

import os
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
os.environ['LITELLM_LOG'] = 'DEBUG'
os.environ.pop('OPENAI_API_KEY', None)

key = os.getenv("OPENAI_API_KEY")
print(key) # Prints None
async def extract_tech_content(url):
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=url,
            extraction_strategy=LLMExtractionStrategy(
                provider="ollama/llama3.2",
                api_base="https://iris.aihello.com/ollama/api/generate/",
                instruction="Get me the page content without any header/footer tab data",
                api_token=None
            ),

            bypass_cache=True,
        )

    tech_content = json.loads(result.extracted_content)
    print(f"Number of tech-related items extracted: {len(tech_content)}")

    with open(r"data/output_2.json", "w+", encoding="utf-8") as f:
        json.dump(tech_content, f, indent=2)
    return tech_content 

if __name__ == "__main__":
    tech_content = asyncio.run(extract_tech_content("https://medium.com/learning-new-stuff/tips-for-writing-medium-articles-df8d7c7b33bf"))

This is the error that I get!

Traceback (most recent call last):
  File "E:\Prajwal\llm-project\llm_scrapping\crawl4ai_scrapper\crawl4ai_with_ollama.py", line 33, in <module>
    tech_content = asyncio.run(extract_tech_content("https://medium.com/learning-new-stuff/tips-for-writing-medium-articles-df8d7c7b33bf"))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\Lib\asyncio\base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "E:\Prajwal\llm-project\llm_scrapping\crawl4ai_scrapper\crawl4ai_with_ollama.py", line 15, in extract_tech_content
    extraction_strategy=LLMExtractionStrategy(
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Prajwal\llm-project\llm_scrapping\env\Lib\site-packages\crawl4ai\extraction_strategy.py", line 89, in __init__
    raise ValueError("API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.")
ValueError: API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.

Am, I doing something wrong? or is it an issue with Crawl4AI??

Praj-17 avatar Oct 14 '24 09:10 Praj-17

Hi, even i am using ollama as the provider, try like this, this works fine for me strategy = LLMExtractionStrategy( provider="ollama/llama3",
base_url='http url for your ollama serivce', api_token='ollama', apply_chunking=True, bypass_cache=True, ) Here for the model, give your model name

Mahizha-N-S avatar Oct 14 '24 09:10 Mahizha-N-S

Hi @Praj-17 thc for using Crawl4ai. At least, you should pass something because the api_token parameter is compulsory. So you can pass anything. Similar to what @Mahizha-N-S illustrated here, just write the name 'Ollama' and then it works.

unclecode avatar Oct 16 '24 06:10 unclecode

Hi, I'm coming back to this issue again. I have my ollama server on the URL as present in the original code. However, crawl4ai still uses the local llama that has been downloaded on the machine running the original code instead of running llama from the given API base.

However, this code from the Litellm docs, works perfectly.

from litellm import completion

response = completion(
    model="ollama/llama2", 
    messages=[{ "content": "respond in 20 words. who are you?","role": "user"}], 
    api_base="http://localhost:11434" # replacing this local host to my server endpoint
)
print(response)

I tried passing the api_base parameter in the AsyncWebCrawler but it doesn't help.

Any suggestions @unclecode ?

Praj-17 avatar Oct 18 '24 14:10 Praj-17

@Praj-17 This is very interesting. Okay, let me check on this by this week and I will get back to you soon. I feel something is missing somewhere. Perhaps what I'm going to do is set up two Ollama servers locally, one on the default port and one on the new port. Then, I'll switch to that one and see how it goes. Because it's easier for me to set up an Ollama on the server locally than to set one up on a cloud.

unclecode avatar Oct 20 '24 11:10 unclecode