Archon icon indicating copy to clipboard operation
Archon copied to clipboard

🐛 [Bug]: Sitemap ingestion fails when XML contains relative URLs; task disappears from UI without error

Open thiagomaf opened this issue 2 months ago • 1 comments

Archon Version

v0.1.0

Bug Severity

🟠 High - Blocks important features

Bug Description

When attempting to ingest a sitemap.xml that contains relative URLs (e.g., /docs/apps/ instead of https://example.com/docs/apps/), the ingestion fails silently. The task disappears from the UI without any error message, but the server logs show URL validation errors.

Many sitemap generators output relative URLs to keep the XML portable across domains, making this a common scenario.

Steps to Reproduce

  1. Go to Knowledge Base page
  2. Click "Add Knowledge"
  3. Enter sitemap URL: https://waha.devlike.pro/docs/sitemap.xml
  4. Click "Add Source"
  5. Task starts, then disappears from UI
  6. Check Docker logs to see actual error

Expected Behavior

Option 1 (Preferred):

  • Sitemap parser should compose absolute URLs from the base URL + relative paths
  • Example: https://waha.devlike.pro + /docs/apps/ = https://waha.devlike.pro/docs/apps/
  • Crawling should proceed successfully

Option 2 (Fallback):

  • If relative URLs can't be handled, show clear error in UI
  • Error message: "Sitemap contains relative URLs. Please provide a sitemap with absolute URLs or use the base URL directly."
  • Task should remain visible with error state, not disappear

Actual Behavior

  • Sitemap parser extracts relative URLs as-is
  • Crawler receives /docs/apps/ and rejects it (not a valid URL)
  • Task disappears from UI without any error notification
  • User has no indication of what went wrong

Error Details (if any)

[ERROR]... × /docs/apps/                                        | Error: 

Unexpected error in _crawl_web at line 500 in crawl 
(../venv/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):

Error: URL must start with 'http://', 'https://', 'file://', or 'raw:'

Code context:
 495                   status_code=status_code,
 496                   screenshot=screenshot_data,
 497                   get_delayed_content=None,
 498               )
 499           else:
 500 →             raise ValueError(
 501                   "URL must start with 'http://', 'https://', 'file://', or 'raw:'"
 502               )

Affected Component

🔍 Knowledge Base / RAG

Browser & OS

Edge on Windows 10

Additional Context

Example Sitemap Entry

<url>
  <loc>/docs/apps/</loc>
  <lastmod>2020-10-06T08:48:45+00:00</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.5</priority>
</url>

Affected Components

  • 🔍 Knowledge Base / RAG (sitemap parsing logic)
  • 🖥️ Frontend UI (error handling and display)

Technical Notes

Root Causes:

  1. Sitemap parser (python/src/server/services/crawling/strategies/sitemap.py) likely extracts <loc> content verbatim without checking if it's relative
  2. URL composition logic missing: No code to combine base URL with relative paths
  3. UI error handling: Failed ingestion tasks disappear instead of showing error state

Suggested Fix:

  • Use urllib.parse.urljoin(base_url, relative_url) to compose absolute URLs
  • Validate composed URLs before passing to crawler
  • Surface errors to UI instead of silent failure

Related Issues

  • #130 - Surface crawling errors in UI (closed, but gap remains for this case)
  • #607 - Different issue (URLs containing "sitemap" in path)

Service Status (check all that are working)

  • [x] 🖥️ Frontend UI (http://localhost:3737)
  • [x] ⚙️ Main Server (http://localhost:8181)
  • [x] 🔗 MCP Service (localhost:8051)
  • [ ] 🤖 Agents Service (http://localhost:8052)
  • [ ] 💾 Supabase Database (connected)

thiagomaf avatar Oct 27 '25 09:10 thiagomaf

Thanks for reporting. I just checked and can verify the error/problem.

leex279 avatar Nov 07 '25 20:11 leex279