crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Use JsonCssExtractionStrategy.generate_schema(schema_type="css",...) get css_schema,but sometimes it return XPath schema.

Open catorsu opened this issue 8 months ago • 1 comments

crawl4ai version

0.5.0.post8

Expected Behavior

{ 'name': 'Leiphone Articles', 'baseSelector': 'div.box', 'fields': [ {'name': 'title', 'selector': 'h3 > a', 'type': 'text'}, {'name': 'summary', 'selector': 'div.des', 'type': 'text'}, {'name': 'link', 'selector': 'h3 > a', 'type': 'attribute', 'attribute': 'href'}, {'name': 'date', 'selector': 'div.time', 'type': 'text'} ] }

Current Behavior

{ 'name': 'Leiphone Articles', 'baseSelector': "//div[@class='box']", 'fields': [ {'name': 'title', 'selector': './/h3/a', 'type': 'text'}, {'name': 'summary', 'selector': ".//div[@class='des']", 'type': 'text'}, {'name': 'link', 'selector': './/h3/a', 'type': 'attribute', 'attribute': 'href'}, {'name': 'date', 'selector': ".//div[@class='time']", 'type': 'text'} ] }

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets

url = "https://www.leiphone.com/"
    response = requests.get(url)
    html_content = response.text

    query = """
    Extract the following information from the webpage:
    - Title: The title of the article or news
    - Summary/Description: A brief content of the article or news
    - Link: The URL pointing to the full article or news
    - Date: The publication date of the article or news
    """
    target_json_example = """
    {
        "title": "Example Article Title",
        "summary": "This is a brief description or summary of the article.",
        "link": "https://example.com/article",
        "date": "2023-10-15"
    }
    """

    css_schema = JsonCssExtractionStrategy.generate_schema(
        html=html_content,
        schema_type="css",
        query=query,
        target_json_example=target_json_example,
        llm_config=LLMConfig(
            provider="deepseek/deepseek-chat",
            api_token="my_api_token",
        ),
    )

    css_strategy = JsonCssExtractionStrategy(css_schema)

    print("CSS Schema:", css_schema)

OS

windows

Python version

3.10.11

Browser

chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

catorsu avatar Apr 01 '25 13:04 catorsu

I mistakenly wrote CSS as css, and the error has been resolved.

catorsu avatar Apr 01 '25 13:04 catorsu

I mistakenly wrote CSS as css, and the error has been resolved.

I totally kept doing the same thing going off the llm free strategies part of the docs

# Option 1: Using OpenAI (requires API token)
css_schema = JsonCssExtractionStrategy.generate_schema(
    html,
    schema_type="css", 
    llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")
)

looks like in extraction_strategy.py there is a conditional that's making you use "CSS" kinda confusing just cause of the doc example... so prolly just needs a doc fix or agnostic to case fix.

https://github.com/unclecode/crawl4ai/blob/e1d9e2489cd736d3af9992209268c0f601222c1a/crawl4ai/extraction_strategy.py#L1113

# Use default or custom prompt
 prompt_template = JSON_SCHEMA_BUILDER if schema_type == "CSS" else JSON_SCHEMA_BUILDER_XPATH    

looks correct in this part of the docs:

2-automatic-schema-generation

ztbochanski avatar Apr 08 '25 03:04 ztbochanski

I mistakenly wrote CSS as css, and the error has been resolved.

Glad to see it's fixed, so I closed this issue.

ntohidi avatar Apr 08 '25 11:04 ntohidi