crawl4ai [Bug]: Sitemap url resolution passes trivially in common urls that should fail

[Bug]: Sitemap url resolution passes trivially in common urls that should fail

Open sfrey1 opened this issue 4 weeks ago • 1 comments

crawl4ai version

0.7.6

Expected Behavior

AsyncUrlSeeder._resolve_head(url) should return None when url doesn't resolve

Current Behavior

AsyncUrlSeeder._resolve_head(url) doesn't return None when url doesn't resolve.

async def _resolve_head(self, url: str) -> Optional[str]:
        """
        HEAD-probe a URL.

        Returns:
            * the same URL if it answers 2xx,
            * the absolute redirect target if it answers 3xx,
            * None on any other status or network error.
        """
        try:
            r = await self.client.head(url, timeout=10, follow_redirects=False)

            # direct hit
            if 200 <= r.status_code < 300:
                return str(r.url)

            # single level redirect

            # passes trivially on http -> https redirects, e.g. "http://www.youtube.com/sitemap.xml" -> "https://www.youtube.com/sitemap.xml"
            # even though "https://www.youtube.com/sitemap.xml" doesn't resolve

            # passes if redirect description is misconfigured, e.g. "https://www.stripe.com/sitemap.xml" -> "https://www.stripe.com/sitemap.xml"
            # even though `response.headers.get("location") == "https://www.stripe.com/sitemap.xml"`

            if r.status_code in (301, 302, 303, 307, 308):
                loc = r.headers.get("location")
                if loc:
                    return urljoin(url, loc)

            return None

        except Exception as e:
            self._log("debug", "HEAD {url} failed: {err}",
                      params={"url": url, "err": str(e)}, tag="URL_SEED")
            return None

Is this reproducible?

Yes

Inputs Causing the Bug

I tested the `AsyncUrlSeeder` using "https://www.stripe.com" and "https://www.youtube.com". I'm sure a large fraction of urls are afflicted by this behaviour.

Steps to Reproduce

Construct a seeder and seed.

Code snippets

import asyncio

from crawl4ai import AsyncUrlSeeder, SeedingConfig

async def crawl() -> list[str]:
    seeding = SeedingConfig(
        source="sitemap",
        extract_head=False,
        filter_nonsense_urls=True,
        max_urls=20,
        verbose=False,
        force=True,
    )

    seeder = AsyncUrlSeeder()
    assert await seeder._resolve_head("http://youtube.com/sitemap.xml") is None
    assert await seeder._resolve_head("https://stripe.com/sitemap.xml") is None

asyncio.run(crawl())

OS

MacOS

Python version

Python 3.11.9

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Nov 11 '25 23:11 sfrey1

crawl4ai crawl4ai copied to clipboard

[Bug]: Sitemap url resolution passes trivially in common urls that should fail

crawl4ai version

Expected Behavior

Current Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

crawl4ai
crawl4ai copied to clipboard