crawl4ai
crawl4ai copied to clipboard
[Bug]: Sitemap url resolution passes trivially in common urls that should fail
crawl4ai version
0.7.6
Expected Behavior
AsyncUrlSeeder._resolve_head(url) should return None when url doesn't resolve
Current Behavior
AsyncUrlSeeder._resolve_head(url) doesn't return None when url doesn't resolve.
async def _resolve_head(self, url: str) -> Optional[str]:
"""
HEAD-probe a URL.
Returns:
* the same URL if it answers 2xx,
* the absolute redirect target if it answers 3xx,
* None on any other status or network error.
"""
try:
r = await self.client.head(url, timeout=10, follow_redirects=False)
# direct hit
if 200 <= r.status_code < 300:
return str(r.url)
# single level redirect
# passes trivially on http -> https redirects, e.g. "http://www.youtube.com/sitemap.xml" -> "https://www.youtube.com/sitemap.xml"
# even though "https://www.youtube.com/sitemap.xml" doesn't resolve
# passes if redirect description is misconfigured, e.g. "https://www.stripe.com/sitemap.xml" -> "https://www.stripe.com/sitemap.xml"
# even though `response.headers.get("location") == "https://www.stripe.com/sitemap.xml"`
if r.status_code in (301, 302, 303, 307, 308):
loc = r.headers.get("location")
if loc:
return urljoin(url, loc)
return None
except Exception as e:
self._log("debug", "HEAD {url} failed: {err}",
params={"url": url, "err": str(e)}, tag="URL_SEED")
return None
Is this reproducible?
Yes
Inputs Causing the Bug
I tested the `AsyncUrlSeeder` using "https://www.stripe.com" and "https://www.youtube.com". I'm sure a large fraction of urls are afflicted by this behaviour.
Steps to Reproduce
Construct a seeder and seed.
Code snippets
import asyncio
from crawl4ai import AsyncUrlSeeder, SeedingConfig
async def crawl() -> list[str]:
seeding = SeedingConfig(
source="sitemap",
extract_head=False,
filter_nonsense_urls=True,
max_urls=20,
verbose=False,
force=True,
)
seeder = AsyncUrlSeeder()
assert await seeder._resolve_head("http://youtube.com/sitemap.xml") is None
assert await seeder._resolve_head("https://stripe.com/sitemap.xml") is None
asyncio.run(crawl())
OS
MacOS
Python version
Python 3.11.9
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response