[Bug]: Use JsonCssExtractionStrategy.generate_schema(schema_type="css",...) get css_schema,but sometimes it return XPath schema.
crawl4ai version
0.5.0.post8
Expected Behavior
{ 'name': 'Leiphone Articles', 'baseSelector': 'div.box', 'fields': [ {'name': 'title', 'selector': 'h3 > a', 'type': 'text'}, {'name': 'summary', 'selector': 'div.des', 'type': 'text'}, {'name': 'link', 'selector': 'h3 > a', 'type': 'attribute', 'attribute': 'href'}, {'name': 'date', 'selector': 'div.time', 'type': 'text'} ] }
Current Behavior
{ 'name': 'Leiphone Articles', 'baseSelector': "//div[@class='box']", 'fields': [ {'name': 'title', 'selector': './/h3/a', 'type': 'text'}, {'name': 'summary', 'selector': ".//div[@class='des']", 'type': 'text'}, {'name': 'link', 'selector': './/h3/a', 'type': 'attribute', 'attribute': 'href'}, {'name': 'date', 'selector': ".//div[@class='time']", 'type': 'text'} ] }
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
url = "https://www.leiphone.com/"
response = requests.get(url)
html_content = response.text
query = """
Extract the following information from the webpage:
- Title: The title of the article or news
- Summary/Description: A brief content of the article or news
- Link: The URL pointing to the full article or news
- Date: The publication date of the article or news
"""
target_json_example = """
{
"title": "Example Article Title",
"summary": "This is a brief description or summary of the article.",
"link": "https://example.com/article",
"date": "2023-10-15"
}
"""
css_schema = JsonCssExtractionStrategy.generate_schema(
html=html_content,
schema_type="css",
query=query,
target_json_example=target_json_example,
llm_config=LLMConfig(
provider="deepseek/deepseek-chat",
api_token="my_api_token",
),
)
css_strategy = JsonCssExtractionStrategy(css_schema)
print("CSS Schema:", css_schema)
OS
windows
Python version
3.10.11
Browser
chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response
I mistakenly wrote CSS as css, and the error has been resolved.
I mistakenly wrote CSS as css, and the error has been resolved.
I totally kept doing the same thing going off the llm free strategies part of the docs
# Option 1: Using OpenAI (requires API token)
css_schema = JsonCssExtractionStrategy.generate_schema(
html,
schema_type="css",
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")
)
looks like in extraction_strategy.py there is a conditional that's making you use "CSS" kinda confusing just cause of the doc example... so prolly just needs a doc fix or agnostic to case fix.
https://github.com/unclecode/crawl4ai/blob/e1d9e2489cd736d3af9992209268c0f601222c1a/crawl4ai/extraction_strategy.py#L1113
# Use default or custom prompt
prompt_template = JSON_SCHEMA_BUILDER if schema_type == "CSS" else JSON_SCHEMA_BUILDER_XPATH
looks correct in this part of the docs:
I mistakenly wrote CSS as css, and the error has been resolved.
Glad to see it's fixed, so I closed this issue.