crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: JsonCssExtractionStrategy not returning results (even with doc example)

Open encoded-evolution opened this issue 10 months ago β€’ 4 comments

crawl4ai version

0.4.248

Expected Behavior

JsonCssExtractionStrategy should return results, and using the example in "Pattern-Based with JsonCssExtractionStrategy" should not return empty.

Current Behavior

I was trying to properly configure JsonCssExtractionStrategy for my use, and I continually got no results even with a very simple schema. So, I went back to the example from the docs, pasted it into a script and ran it with no response. See screenshot. (I tried changing baseSelector to "tr.athing submission" because that is what ycombinator shows as the current table row style. But no variations worked.)

See bottom: "Sample extracted items: []"

Image

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce

Run the sample script as-is

Code snippets

Exactly as from https://docs.crawl4ai.com/core/content-selection/ section 4.1


import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # Minimal schema for repeated items
    schema = {
        "name": "News Items",
        "baseSelector": "tr.athing",
        "fields": [
            {"name": "title", "selector": "a.storylink", "type": "text"},
            {
                "name": "link", 
                "selector": "a.storylink", 
                "type": "attribute", 
                "attribute": "href"
            }
        ]
    }

    config = CrawlerRunConfig(
        # Content filtering
        excluded_tags=["form", "header"],
        exclude_domains=["adsite.com"],

        # CSS selection or entire page
        css_selector="table.itemlist",

        # No caching for demonstration
        cache_mode=CacheMode.BYPASS,

        # Extraction strategy
        extraction_strategy=JsonCssExtractionStrategy(schema)
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://news.ycombinator.com/newest", 
            config=config
        )
        data = json.loads(result.extracted_content)
        print("Sample extracted item:", data[:1])  # Show first item

if __name__ == "__main__":
    asyncio.run(main())

OS

MacOS

Python version

3.12.8

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

encoded-evolution avatar Feb 10 '25 13:02 encoded-evolution

@encoded-evolution Could you update the issue with your "code snippet", in section where it says so. Currently you just shared screenshot of you code, it's hard to investigate this issue using just that.

aravindkarnam avatar Feb 11 '25 06:02 aravindkarnam

@aravindkarnam I just updated above.

It is literally a copy/paste from https://docs.crawl4ai.com/core/content-selection/ section 4.1

That's exactly what the test was.

encoded-evolution avatar Feb 12 '25 11:02 encoded-evolution

@encoded-evolution It seems there is an issue with the structure of the page on this link. Please try the schema I’ve posted below; it’s working on my end.

@aravindkarnam The documentation needs to be updated to reflect this.


schema = {
        "name": "News Items",
        "baseSelector": "tr.athing",
        "fields": [
            {"name": "title", "selector": "span.titleline", "type": "text"},
            {
                "name": "link", 
                "selector": "span.titleline a", 
                "type": "attribute", 
                "attribute": "href"
            }
        ]
    }

sufianuddin avatar Feb 12 '25 13:02 sufianuddin

@sufianuddin thanks, I am new to working with webcrawlers in general, so your help is appreciated.

I can confirm this is not a bug and this thread can be closed, with sufi's schema it works as expected.

@aravindkarnam Web page changes happen all the time and if you base documentation on a moving target, your docs will always be out of date. Recommend you show the structure that your examples are designed for instead of relying on a website never going out of date. For instance, you can provide a reference image from a browser's inspect panel as shown below. And n00bs like me will be able to better understand what your software does. (BTW: you are building a great tool here! Awesome job!)

Image

encoded-evolution avatar Feb 12 '25 21:02 encoded-evolution

@sufianuddin Thanks for updating the example. Great job! I've updated the example based on your input.

Web page changes happen all the time and if you base documentation on a moving target, your docs will always be out of date

@encoded-evolution That's true. But we want to give real life examples that are both useful and interesting. We have a vibrant community that's taking care of our documentation, so I'm confident that with time we'll be able to keep up with changes. Thanks for trying Crawl4AI, keep coming back! πŸ™ŒπŸΌ

aravindkarnam avatar Feb 14 '25 13:02 aravindkarnam

Updated documentation is now available at https://docs.crawl4ai.com/core/content-selection/

aravindkarnam avatar Mar 04 '25 12:03 aravindkarnam