crawl4ai
crawl4ai copied to clipboard
[Bug]: KeyError: 'provider' in LLMExtractionStrategy when using Ollama (llama3.2:3b)
crawl4ai version
Crawl4AI 0.6.3
Expected Behavior
I expected to be able to use Crawl4AI 0.6.3 with a local Ollama model (llama3.2:3b) to extract structured data (news articles) from websites. Specifically, after configuring LLMExtractionStrategy with LLMConfig (including the provider set to "ollama/llama3.2:3b"), I expected the script to initialize correctly and proceed with crawling and extracting data without errors.
Current Behavior
Instead of the expected behavior, I encounter a KeyError: 'provider' error when initializing LLMExtractionStrategy. The error occurs in the setattr method of LLMExtractionStrategy, specifically when it tries to access all_params[name].default with name being 'provider'. This prevents the script from proceeding further.
Is this reproducible?
Yes
Inputs Causing the Bug
URL(s):
https://example.com/news
https://example.org/news
(These are placeholder URLs for the purpose of this report; the actual URLs are news websites.)
Settings used:
Crawl4AI version: 0.6.3
Model: llama3.2:3b (via Ollama)
Provider: "ollama/llama3.2:3b"
API base URL: http://localhost:11434/v1
API token: "no-token" (not required for local Ollama setup)
Browser configuration: Headless mode, JavaScript enabled, HTTPS errors ignored
Cache mode: BYPASS
Input data (if applicable):
The script loads configuration from a config.json file (see code snippets below for details).
Steps to Reproduce
Set up a local Ollama server with the llama3.2:3b model running at http://localhost:11434/v1.
Create a Python script (e.g., extract.py) with the following code to use Crawl4AI for extracting data (see code snippets below).
Create a config.json file with the configuration for the script (see code snippets below).
Run the script using python extract.py.
Observe the KeyError: 'provider' error in the console.
Code snippets
extract.py
import json
import os
from crawl4ai.async_webcrawler import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.async_configs import CrawlerRunConfig, BrowserConfig, LLMConfig
from crawl4ai.cache_context import CacheMode
import logging
import asyncio
# Logging configuration
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[logging.StreamHandler()])
logger = logging.getLogger(__name__)
# Custom extraction strategy for Ollama
class CustomExtractionStrategy(LLMExtractionStrategy):
def __init__(self,
model_name: str,
categories: list,
prompt_template: str,
language: str,
tone: str,
api_base_url: str,
llm_config: LLMConfig
):
# Initialize LLMExtractionStrategy
super().__init__(
extraction_type="schema",
schema={
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {"type": "string"},
"description": {"type": "string"},
"date": {"type": "string"},
"category": {"type": "string", "enum": categories}
},
"required": ["title", "description", "date", "category"]
}
},
instruction="", # Set later
base_url=api_base_url,
api_token=llm_config.api_token,
model=model_name,
llm_config=llm_config
)
# Set up the prompt for extraction
self.prompt_template = prompt_template
self.categories = categories
self.language = language
self.tone = tone
full_prompt = self.prompt_template.format(
categories=self.categories,
language=self.language,
tone=self.tone
)
self.instruction = full_prompt
# Main function to extract data
async def extract_data():
logger.info("Starting data extraction process...")
# Load configuration
config_path = os.path.join(os.path.dirname(__file__), 'config.json')
with open(config_path, 'r', encoding='utf-8') as f:
config = json.load(f)
sources = config['sources']
categories = config['categories']
llm_config_dict = config['llm']
prompt_template = llm_config_dict['prompt_template']
language = config['location']['language']
tone = config['newsletter']['tone']
api_base_url = llm_config_dict['api_base_url']
provider = f"ollama/{llm_config_dict['model']}" # "ollama/llama3.2:3b"
# Initialize LLMConfig with provider
llm_config_obj = LLMConfig(
provider=provider,
base_url=api_base_url,
api_token=llm_config_dict['api_token']
)
# Create extraction strategy
extraction_strategy = CustomExtractionStrategy(
model_name=llm_config_dict['model'],
categories=categories,
prompt_template=prompt_template,
language=language,
tone=tone,
api_base_url=api_base_url,
llm_config=llm_config_obj
)
# Configure browser for crawling
browser_config = BrowserConfig(
headless=True,
ignore_https_errors=True,
enable_javascript=True
)
# Run crawler for each URL
async with AsyncWebCrawler(config=browser_config) as crawler:
for url in sources:
logger.info(f"Processing URL: {url}")
crawl_config = CrawlerRunConfig(
browser_config=browser_config,
llm_config=llm_config_obj,
llm_extraction_strategy=extraction_strategy,
cache_mode=CacheMode.BYPASS,
verbose=True
)
result = await crawler.arun(url=url, config=crawl_config)
if __name__ == "__main__":
asyncio.run(extract_data())
config.json
{
"sources": [
"https://example.com/news",
"https://example.org/news"
],
"categories": [
"Politics",
"Culture",
"Sports"
],
"location": {
"language": "English"
},
"newsletter": {
"tone": "friendly"
},
"llm": {
"model": "llama3.2:3b",
"provider": "ollama/llama3.2:3b",
"api_base_url": "http://localhost:11434/v1",
"api_token": "no-token",
"prompt_template": "Extract the title, description, date, and category of each news article from the page. Use the categories: {categories}. Ignore menus, ads, or irrelevant content. Generate descriptions in {language} with a {tone} tone. If no articles are found, return an empty array. Do not invent information."
}
}
OS
Windows 10
Python version
3.11
Browser
Chromium (used by Crawl4AI in headless mode)
Browser version
The version used by Crawl4AI 0.6.3 (Iβm not sure of the exact Chromium version, as itβs bundled with Crawl4AI).
Error logs & Screenshots (if applicable)
ERROR: 'provider'
Traceback (most recent call last):
File "D:\Projects\Barcelona\scripts\extract.py", line 218, in