[Bug]: Title is coming as null for some of websites
crawl4ai version
crawl4ai/_version.py version = "0.5.0.post8"
Expected Behavior
Here is metadata object should show title as below:
"metdata": { "title": "Where's my refund? | Internal Revenue Service", "description": "See your personalized refund date as soon as the IRS processes your tax return and approves your refund. See your status starting around 24 hours after you e-file or 4 weeks after you mail a paper return.", "keywords": null, "author": null, "og:image:url": "https://www.irs.gov/pub/image/logo_small.jpg", "og:image:type": "image/jpeg", "og:image:alt": "IRS logo", "twitter:card": "summary", "twitter:description": "See your personalized refund date as soon as the IRS processes your tax return and approves your refund. See your status starting around 24 hours after you e-file or 4 weeks after you mail a paper return.", "twitter:title": "Where's my refund? | Internal Revenue Service", "twitter:image": "https://www.irs.gov/pub/image/logo_small.jpg", "twitter:image:alt": "IRS logo" }
Current Behavior
"metdata": { "title": null "description": "See your personalized refund date as soon as the IRS processes your tax return and approves your refund. See your status starting around 24 hours after you e-file or 4 weeks after you mail a paper return.", "keywords": null, "author": null, "og:image:url": "https://www.irs.gov/pub/image/logo_small.jpg", "og:image:type": "image/jpeg", "og:image:alt": "IRS logo", "twitter:card": "summary", "twitter:description": "See your personalized refund date as soon as the IRS processes your tax return and approves your refund. See your status starting around 24 hours after you e-file or 4 weeks after you mail a paper return.", "twitter:title": "Where's my refund? | Internal Revenue Service", "twitter:image": "https://www.irs.gov/pub/image/logo_small.jpg", "twitter:image:alt": "IRS logo" }
Is this reproducible?
Yes
Inputs Causing the Bug
any URLs from https://www.irs.gov/
like:
https://www.irs.gov/wheres-my-refund
Steps to Reproduce
run provided code snippest
Code snippets
# script.py
from typing import Literal
import sys
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import asyncio
format = "markdown"
browser_config = BrowserConfig(verbose=False) # Default browser configuration
run_config = CrawlerRunConfig(verbose=False) # Default crawl run configuration
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.irs.gov/wheres-my-refund",
config=run_config
)
# Print clean markdown content
output = { }
output["metadata"] = result.metadata
print(json.dumps(output))
OS
macOS
Python version
3.13.2
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
RCA
The WebScrapingStrategy uses BeautifulSoup with the 'lxml' parser by default. This specific parser is failing to correctly identify or include the <title> tag in the parsed soup.head object when processing the "irs.gov" HTML.
cc @aravindkarnam
Can we have config allow to use other parser for Beautifulsoup?
OK, for anyone who want to use different parser for BeautifulSoup, you can add it like this: crawler.arun(parser='html.parser').
But the origin question is about handling invalid html site, so you need other way to handle this situiation.
The issue has been fixed in the develop branch and will be released soon.
It has been released and is now in the main branch.