crawl4ai
crawl4ai copied to clipboard
crawling with Ollama uses OpenAI and requires API_token to be set
Hello I am trying to use crawl4ai with Ollama as backend as listed in the in the config.py and mentioned on the providers page of Ollama. Note that I already have an OpenAI key set to the env, hence I am popping it , so that craw4ai doesn't use it under the hood.
Here is what I am doing
import os
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
os.environ['LITELLM_LOG'] = 'DEBUG'
os.environ.pop('OPENAI_API_KEY', None)
key = os.getenv("OPENAI_API_KEY")
print(key) # Prints None
async def extract_tech_content(url):
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url,
extraction_strategy=LLMExtractionStrategy(
provider="ollama/llama3.2",
api_base="https://iris.aihello.com/ollama/api/generate/",
instruction="Get me the page content without any header/footer tab data",
api_token=None
),
bypass_cache=True,
)
tech_content = json.loads(result.extracted_content)
print(f"Number of tech-related items extracted: {len(tech_content)}")
with open(r"data/output_2.json", "w+", encoding="utf-8") as f:
json.dump(tech_content, f, indent=2)
return tech_content
if __name__ == "__main__":
tech_content = asyncio.run(extract_tech_content("https://medium.com/learning-new-stuff/tips-for-writing-medium-articles-df8d7c7b33bf"))
This is the error that I get!
Traceback (most recent call last):
File "E:\Prajwal\llm-project\llm_scrapping\crawl4ai_scrapper\crawl4ai_with_ollama.py", line 33, in <module>
tech_content = asyncio.run(extract_tech_content("https://medium.com/learning-new-stuff/tips-for-writing-medium-articles-df8d7c7b33bf"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\asyncio\runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\Lib\asyncio\base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "E:\Prajwal\llm-project\llm_scrapping\crawl4ai_scrapper\crawl4ai_with_ollama.py", line 15, in extract_tech_content
extraction_strategy=LLMExtractionStrategy(
^^^^^^^^^^^^^^^^^^^^^^
File "E:\Prajwal\llm-project\llm_scrapping\env\Lib\site-packages\crawl4ai\extraction_strategy.py", line 89, in __init__
raise ValueError("API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.")
ValueError: API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.
Am, I doing something wrong? or is it an issue with Crawl4AI??
Hi, even i am using ollama as the provider, try like this, this works fine for me
strategy = LLMExtractionStrategy(
provider="ollama/llama3",
base_url='http url for your ollama serivce',
api_token='ollama',
apply_chunking=True,
bypass_cache=True,
)
Here for the model, give your model name
Hi @Praj-17 thc for using Crawl4ai. At least, you should pass something because the api_token parameter is compulsory. So you can pass anything. Similar to what @Mahizha-N-S illustrated here, just write the name 'Ollama' and then it works.
Hi, I'm coming back to this issue again. I have my ollama server on the URL as present in the original code. However, crawl4ai still uses the local llama that has been downloaded on the machine running the original code instead of running llama from the given API base.
However, this code from the Litellm docs, works perfectly.
from litellm import completion
response = completion(
model="ollama/llama2",
messages=[{ "content": "respond in 20 words. who are you?","role": "user"}],
api_base="http://localhost:11434" # replacing this local host to my server endpoint
)
print(response)
I tried passing the api_base parameter in the AsyncWebCrawler but it doesn't help.
Any suggestions @unclecode ?
@Praj-17 This is very interesting. Okay, let me check on this by this week and I will get back to you soon. I feel something is missing somewhere. Perhaps what I'm going to do is set up two Ollama servers locally, one on the default port and one on the new port. Then, I'll switch to that one and see how it goes. Because it's easier for me to set up an Ollama on the server locally than to set one up on a cloud.