crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: KeyError: 'provider' in LLMExtractionStrategy when using Ollama (llama3.2:3b)

Open carla4av opened this issue 6 months ago β€’ 0 comments

crawl4ai version

Crawl4AI 0.6.3

Expected Behavior

I expected to be able to use Crawl4AI 0.6.3 with a local Ollama model (llama3.2:3b) to extract structured data (news articles) from websites. Specifically, after configuring LLMExtractionStrategy with LLMConfig (including the provider set to "ollama/llama3.2:3b"), I expected the script to initialize correctly and proceed with crawling and extracting data without errors.

Current Behavior

Instead of the expected behavior, I encounter a KeyError: 'provider' error when initializing LLMExtractionStrategy. The error occurs in the setattr method of LLMExtractionStrategy, specifically when it tries to access all_params[name].default with name being 'provider'. This prevents the script from proceeding further.

Is this reproducible?

Yes

Inputs Causing the Bug

URL(s):  
https://example.com/news  

https://example.org/news
(These are placeholder URLs for the purpose of this report; the actual URLs are news websites.)

Settings used:  
Crawl4AI version: 0.6.3  

Model: llama3.2:3b (via Ollama)  

Provider: "ollama/llama3.2:3b"  

API base URL: http://localhost:11434/v1  

API token: "no-token" (not required for local Ollama setup)  

Browser configuration: Headless mode, JavaScript enabled, HTTPS errors ignored  

Cache mode: BYPASS

Input data (if applicable):  
The script loads configuration from a config.json file (see code snippets below for details).

Steps to Reproduce

Set up a local Ollama server with the llama3.2:3b model running at http://localhost:11434/v1.  

Create a Python script (e.g., extract.py) with the following code to use Crawl4AI for extracting data (see code snippets below).  

Create a config.json file with the configuration for the script (see code snippets below).  

Run the script using python extract.py.  

Observe the KeyError: 'provider' error in the console.

Code snippets

extract.py
import json
import os
from crawl4ai.async_webcrawler import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.async_configs import CrawlerRunConfig, BrowserConfig, LLMConfig
from crawl4ai.cache_context import CacheMode
import logging
import asyncio

# Logging configuration
logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    handlers=[logging.StreamHandler()])
logger = logging.getLogger(__name__)

# Custom extraction strategy for Ollama
class CustomExtractionStrategy(LLMExtractionStrategy):
    def __init__(self, 
                 model_name: str, 
                 categories: list, 
                 prompt_template: str, 
                 language: str, 
                 tone: str, 
                 api_base_url: str, 
                 llm_config: LLMConfig
                ):
        # Initialize LLMExtractionStrategy
        super().__init__(
            extraction_type="schema",
            schema={
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "description": {"type": "string"},
                        "date": {"type": "string"},
                        "category": {"type": "string", "enum": categories}
                    },
                    "required": ["title", "description", "date", "category"]
                }
            },
            instruction="",  # Set later
            base_url=api_base_url,
            api_token=llm_config.api_token,
            model=model_name,
            llm_config=llm_config
        )
        
        # Set up the prompt for extraction
        self.prompt_template = prompt_template
        self.categories = categories
        self.language = language
        self.tone = tone
        full_prompt = self.prompt_template.format(
            categories=self.categories,
            language=self.language,
            tone=self.tone
        )
        self.instruction = full_prompt

# Main function to extract data
async def extract_data():
    logger.info("Starting data extraction process...")
    
    # Load configuration
    config_path = os.path.join(os.path.dirname(__file__), 'config.json')
    with open(config_path, 'r', encoding='utf-8') as f:
        config = json.load(f)

    sources = config['sources']
    categories = config['categories']
    llm_config_dict = config['llm']
    prompt_template = llm_config_dict['prompt_template']
    language = config['location']['language']
    tone = config['newsletter']['tone']
    api_base_url = llm_config_dict['api_base_url']
    provider = f"ollama/{llm_config_dict['model']}"  # "ollama/llama3.2:3b"

    # Initialize LLMConfig with provider
    llm_config_obj = LLMConfig(
        provider=provider,
        base_url=api_base_url,
        api_token=llm_config_dict['api_token']
    )

    # Create extraction strategy
    extraction_strategy = CustomExtractionStrategy(
        model_name=llm_config_dict['model'],
        categories=categories,
        prompt_template=prompt_template,
        language=language,
        tone=tone,
        api_base_url=api_base_url,
        llm_config=llm_config_obj
    )
    
    # Configure browser for crawling
    browser_config = BrowserConfig(
        headless=True,
        ignore_https_errors=True,
        enable_javascript=True
    )

    # Run crawler for each URL
    async with AsyncWebCrawler(config=browser_config) as crawler:
        for url in sources:
            logger.info(f"Processing URL: {url}")
            crawl_config = CrawlerRunConfig(
                browser_config=browser_config,
                llm_config=llm_config_obj,
                llm_extraction_strategy=extraction_strategy,
                cache_mode=CacheMode.BYPASS,
                verbose=True
            )
            result = await crawler.arun(url=url, config=crawl_config)

if __name__ == "__main__":
    asyncio.run(extract_data())

config.json
{
  "sources": [
    "https://example.com/news",
    "https://example.org/news"
  ],
  "categories": [
    "Politics",
    "Culture",
    "Sports"
  ],
  "location": {
    "language": "English"
  },
  "newsletter": {
    "tone": "friendly"
  },
  "llm": {
    "model": "llama3.2:3b",
    "provider": "ollama/llama3.2:3b",
    "api_base_url": "http://localhost:11434/v1",
    "api_token": "no-token",
    "prompt_template": "Extract the title, description, date, and category of each news article from the page. Use the categories: {categories}. Ignore menus, ads, or irrelevant content. Generate descriptions in {language} with a {tone} tone. If no articles are found, return an empty array. Do not invent information."
  }
}

OS

Windows 10

Python version

3.11

Browser

Chromium (used by Crawl4AI in headless mode)

Browser version

The version used by Crawl4AI 0.6.3 (I’m not sure of the exact Chromium version, as it’s bundled with Crawl4AI).

Error logs & Screenshots (if applicable)

ERROR: 'provider' Traceback (most recent call last): File "D:\Projects\Barcelona\scripts\extract.py", line 218, in extracted_data = asyncio.run(extract_data()) File "C:\Users\Carla\AppData\Local\Programs\Python\Python311\Lib\asyncio\runners.py", line 190, in run return runner.run(main) File "C:\Users\Carla\AppData\Local\Programs\Python\Python311\Lib\asyncio\runners.py", line 118, in run return self._loop.run_until_complete(task) File "C:\Users\Carla\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 653, in run_until_complete return future.result() File "D:\Projects\Barcelona\scripts\extract.py", line 153, in extract_data extraction_strategy = CustomExtractionStrategy(...) File "D:\Projects\Barcelona\scripts\extract.py", line 34, in init super().init(...) File "D:\Projects\Barcelona\venv\Lib\site-packages\crawl4ai\extraction_strategy.py", line 570, in init self.provider = provider File "D:\Projects\Barcelona\venv\Lib\site-packages\crawl4ai\extraction_strategy.py", line 582, in setattr if name in self._UNWANTED_PROPS and value is not all_params[name].default: KeyError: 'provider'

Image

carla4av avatar May 28 '25 18:05 carla4av