crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Sitemap url resolution passes trivially in common urls that should fail

Open sfrey1 opened this issue 4 weeks ago • 1 comments

crawl4ai version

0.7.6

Expected Behavior

AsyncUrlSeeder._resolve_head(url) should return None when url doesn't resolve

Current Behavior

AsyncUrlSeeder._resolve_head(url) doesn't return None when url doesn't resolve.

async def _resolve_head(self, url: str) -> Optional[str]:
        """
        HEAD-probe a URL.

        Returns:
            * the same URL if it answers 2xx,
            * the absolute redirect target if it answers 3xx,
            * None on any other status or network error.
        """
        try:
            r = await self.client.head(url, timeout=10, follow_redirects=False)

            # direct hit
            if 200 <= r.status_code < 300:
                return str(r.url)

            # single level redirect

            # passes trivially on http -> https redirects, e.g. "http://www.youtube.com/sitemap.xml" -> "https://www.youtube.com/sitemap.xml"
            # even though "https://www.youtube.com/sitemap.xml" doesn't resolve

            # passes if redirect description is misconfigured, e.g. "https://www.stripe.com/sitemap.xml" -> "https://www.stripe.com/sitemap.xml"
            # even though `response.headers.get("location") == "https://www.stripe.com/sitemap.xml"`

            if r.status_code in (301, 302, 303, 307, 308):
                loc = r.headers.get("location")
                if loc:
                    return urljoin(url, loc)

            return None

        except Exception as e:
            self._log("debug", "HEAD {url} failed: {err}",
                      params={"url": url, "err": str(e)}, tag="URL_SEED")
            return None

Is this reproducible?

Yes

Inputs Causing the Bug

I tested the `AsyncUrlSeeder` using "https://www.stripe.com" and "https://www.youtube.com". I'm sure a large fraction of urls are afflicted by this behaviour.

Steps to Reproduce

Construct a seeder and seed.

Code snippets

import asyncio

from crawl4ai import AsyncUrlSeeder, SeedingConfig

async def crawl() -> list[str]:
    seeding = SeedingConfig(
        source="sitemap",
        extract_head=False,
        filter_nonsense_urls=True,
        max_urls=20,
        verbose=False,
        force=True,
    )

    seeder = AsyncUrlSeeder()
    assert await seeder._resolve_head("http://youtube.com/sitemap.xml") is None
    assert await seeder._resolve_head("https://stripe.com/sitemap.xml") is None

asyncio.run(crawl())

OS

MacOS

Python version

Python 3.11.9

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

sfrey1 avatar Nov 11 '25 23:11 sfrey1