[Bug]: JsonCssExtractionStrategy not returning results (even with doc example)
crawl4ai version
0.4.248
Expected Behavior
JsonCssExtractionStrategy should return results, and using the example in "Pattern-Based with JsonCssExtractionStrategy" should not return empty.
Current Behavior
I was trying to properly configure JsonCssExtractionStrategy for my use, and I continually got no results even with a very simple schema. So, I went back to the example from the docs, pasted it into a script and ran it with no response. See screenshot. (I tried changing baseSelector to "tr.athing submission" because that is what ycombinator shows as the current table row style. But no variations worked.)
See bottom: "Sample extracted items: []"
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Run the sample script as-is
Code snippets
Exactly as from https://docs.crawl4ai.com/core/content-selection/ section 4.1
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main():
# Minimal schema for repeated items
schema = {
"name": "News Items",
"baseSelector": "tr.athing",
"fields": [
{"name": "title", "selector": "a.storylink", "type": "text"},
{
"name": "link",
"selector": "a.storylink",
"type": "attribute",
"attribute": "href"
}
]
}
config = CrawlerRunConfig(
# Content filtering
excluded_tags=["form", "header"],
exclude_domains=["adsite.com"],
# CSS selection or entire page
css_selector="table.itemlist",
# No caching for demonstration
cache_mode=CacheMode.BYPASS,
# Extraction strategy
extraction_strategy=JsonCssExtractionStrategy(schema)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com/newest",
config=config
)
data = json.loads(result.extracted_content)
print("Sample extracted item:", data[:1]) # Show first item
if __name__ == "__main__":
asyncio.run(main())
OS
MacOS
Python version
3.12.8
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
@encoded-evolution Could you update the issue with your "code snippet", in section where it says so. Currently you just shared screenshot of you code, it's hard to investigate this issue using just that.
@aravindkarnam I just updated above.
It is literally a copy/paste from https://docs.crawl4ai.com/core/content-selection/ section 4.1
That's exactly what the test was.
@encoded-evolution It seems there is an issue with the structure of the page on this link. Please try the schema Iβve posted below; itβs working on my end.
@aravindkarnam The documentation needs to be updated to reflect this.
schema = {
"name": "News Items",
"baseSelector": "tr.athing",
"fields": [
{"name": "title", "selector": "span.titleline", "type": "text"},
{
"name": "link",
"selector": "span.titleline a",
"type": "attribute",
"attribute": "href"
}
]
}
@sufianuddin thanks, I am new to working with webcrawlers in general, so your help is appreciated.
I can confirm this is not a bug and this thread can be closed, with sufi's schema it works as expected.
@aravindkarnam Web page changes happen all the time and if you base documentation on a moving target, your docs will always be out of date. Recommend you show the structure that your examples are designed for instead of relying on a website never going out of date. For instance, you can provide a reference image from a browser's inspect panel as shown below. And n00bs like me will be able to better understand what your software does. (BTW: you are building a great tool here! Awesome job!)
@sufianuddin Thanks for updating the example. Great job! I've updated the example based on your input.
Web page changes happen all the time and if you base documentation on a moving target, your docs will always be out of date
@encoded-evolution That's true. But we want to give real life examples that are both useful and interesting. We have a vibrant community that's taking care of our documentation, so I'm confident that with time we'll be able to keep up with changes. Thanks for trying Crawl4AI, keep coming back! ππΌ
Updated documentation is now available at https://docs.crawl4ai.com/core/content-selection/