crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Title is coming as null for some of websites

Open sunilatrb opened this issue 7 months ago • 3 comments

crawl4ai version

crawl4ai/_version.py version = "0.5.0.post8"

Expected Behavior

Here is metadata object should show title as below:

"metdata": { "title": "Where's my refund? | Internal Revenue Service", "description": "See your personalized refund date as soon as the IRS processes your tax return and approves your refund. See your status starting around 24 hours after you e-file or 4 weeks after you mail a paper return.", "keywords": null, "author": null, "og:image:url": "https://www.irs.gov/pub/image/logo_small.jpg", "og:image:type": "image/jpeg", "og:image:alt": "IRS logo", "twitter:card": "summary", "twitter:description": "See your personalized refund date as soon as the IRS processes your tax return and approves your refund. See your status starting around 24 hours after you e-file or 4 weeks after you mail a paper return.", "twitter:title": "Where's my refund? | Internal Revenue Service", "twitter:image": "https://www.irs.gov/pub/image/logo_small.jpg", "twitter:image:alt": "IRS logo" }

Current Behavior

"metdata": { "title": null "description": "See your personalized refund date as soon as the IRS processes your tax return and approves your refund. See your status starting around 24 hours after you e-file or 4 weeks after you mail a paper return.", "keywords": null, "author": null, "og:image:url": "https://www.irs.gov/pub/image/logo_small.jpg", "og:image:type": "image/jpeg", "og:image:alt": "IRS logo", "twitter:card": "summary", "twitter:description": "See your personalized refund date as soon as the IRS processes your tax return and approves your refund. See your status starting around 24 hours after you e-file or 4 weeks after you mail a paper return.", "twitter:title": "Where's my refund? | Internal Revenue Service", "twitter:image": "https://www.irs.gov/pub/image/logo_small.jpg", "twitter:image:alt": "IRS logo" }

Is this reproducible?

Yes

Inputs Causing the Bug

any URLs from https://www.irs.gov/

like:
https://www.irs.gov/wheres-my-refund

Steps to Reproduce

run provided code snippest

Code snippets

# script.py
from typing import Literal
import sys
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import asyncio

format = "markdown"
browser_config = BrowserConfig(verbose=False)  # Default browser configuration
    run_config = CrawlerRunConfig(verbose=False)   # Default crawl run configuration

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.irs.gov/wheres-my-refund",
            config=run_config
        )

     # Print clean markdown content
    output = { }
    output["metadata"] = result.metadata
    print(json.dumps(output))

OS

macOS

Python version

3.13.2

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

sunilatrb avatar Apr 17 '25 04:04 sunilatrb

RCA

The WebScrapingStrategy uses BeautifulSoup with the 'lxml' parser by default. This specific parser is failing to correctly identify or include the <title> tag in the parsed soup.head object when processing the "irs.gov" HTML.

cc @aravindkarnam

ntohidi avatar May 12 '25 09:05 ntohidi

Can we have config allow to use other parser for Beautifulsoup?

huongphamx avatar May 21 '25 05:05 huongphamx

OK, for anyone who want to use different parser for BeautifulSoup, you can add it like this: crawler.arun(parser='html.parser').

But the origin question is about handling invalid html site, so you need other way to handle this situiation.

huongphamx avatar May 21 '25 06:05 huongphamx

The issue has been fixed in the develop branch and will be released soon.

ntohidi avatar Aug 04 '25 11:08 ntohidi

It has been released and is now in the main branch.

ntohidi avatar Aug 10 '25 03:08 ntohidi