🐛 [Bug]: Sitemap ingestion fails when XML contains relative URLs; task disappears from UI without error

Open thiagomaf opened this issue 2 months ago • 1 comments

Archon Version

v0.1.0

Bug Severity

🟠 High - Blocks important features

Bug Description

When attempting to ingest a sitemap.xml that contains relative URLs (e.g., /docs/apps/ instead of https://example.com/docs/apps/), the ingestion fails silently. The task disappears from the UI without any error message, but the server logs show URL validation errors.

Many sitemap generators output relative URLs to keep the XML portable across domains, making this a common scenario.

Steps to Reproduce

Go to Knowledge Base page
Click "Add Knowledge"
Enter sitemap URL: https://waha.devlike.pro/docs/sitemap.xml
Click "Add Source"
Task starts, then disappears from UI
Check Docker logs to see actual error

Expected Behavior

Option 1 (Preferred):

Sitemap parser should compose absolute URLs from the base URL + relative paths
Example: https://waha.devlike.pro + /docs/apps/ = https://waha.devlike.pro/docs/apps/
Crawling should proceed successfully

Option 2 (Fallback):

If relative URLs can't be handled, show clear error in UI
Error message: "Sitemap contains relative URLs. Please provide a sitemap with absolute URLs or use the base URL directly."
Task should remain visible with error state, not disappear

Actual Behavior

Sitemap parser extracts relative URLs as-is
Crawler receives /docs/apps/ and rejects it (not a valid URL)
Task disappears from UI without any error notification
User has no indication of what went wrong

Error Details (if any)

[ERROR]... × /docs/apps/                                        | Error: 

Unexpected error in _crawl_web at line 500 in crawl 
(../venv/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):

Error: URL must start with 'http://', 'https://', 'file://', or 'raw:'

Code context:
 495                   status_code=status_code,
 496                   screenshot=screenshot_data,
 497                   get_delayed_content=None,
 498               )
 499           else:
 500 →             raise ValueError(
 501                   "URL must start with 'http://', 'https://', 'file://', or 'raw:'"
 502               )

Affected Component

🔍 Knowledge Base / RAG

Browser & OS

Edge on Windows 10

Additional Context

Example Sitemap Entry

<url>
  <loc>/docs/apps/</loc>
  <lastmod>2020-10-06T08:48:45+00:00</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.5</priority>
</url>

Affected Components

🔍 Knowledge Base / RAG (sitemap parsing logic)
🖥️ Frontend UI (error handling and display)

Technical Notes

Root Causes:

Sitemap parser (python/src/server/services/crawling/strategies/sitemap.py) likely extracts <loc> content verbatim without checking if it's relative
URL composition logic missing: No code to combine base URL with relative paths
UI error handling: Failed ingestion tasks disappear instead of showing error state

Suggested Fix:

Use urllib.parse.urljoin(base_url, relative_url) to compose absolute URLs
Validate composed URLs before passing to crawler
Surface errors to UI instead of silent failure

Related Issues

#130 - Surface crawling errors in UI (closed, but gap remains for this case)
#607 - Different issue (URLs containing "sitemap" in path)

Service Status (check all that are working)

[x] 🖥️ Frontend UI (http://localhost:3737)
[x] ⚙️ Main Server (http://localhost:8181)
[x] 🔗 MCP Service (localhost:8051)
[ ] 🤖 Agents Service (http://localhost:8052)
[ ] 💾 Supabase Database (connected)

Oct 27 '25 09:10 thiagomaf

Thanks for reporting. I just checked and can verify the error/problem.

Nov 07 '25 20:11 leex279