Archon
Archon copied to clipboard
🐛 [Bug]: Sitemap ingestion fails when XML contains relative URLs; task disappears from UI without error
Archon Version
v0.1.0
Bug Severity
🟠 High - Blocks important features
Bug Description
When attempting to ingest a sitemap.xml that contains relative URLs (e.g., /docs/apps/ instead of https://example.com/docs/apps/), the ingestion fails silently. The task disappears from the UI without any error message, but the server logs show URL validation errors.
Many sitemap generators output relative URLs to keep the XML portable across domains, making this a common scenario.
Steps to Reproduce
- Go to Knowledge Base page
- Click "Add Knowledge"
- Enter sitemap URL:
https://waha.devlike.pro/docs/sitemap.xml - Click "Add Source"
- Task starts, then disappears from UI
- Check Docker logs to see actual error
Expected Behavior
Option 1 (Preferred):
- Sitemap parser should compose absolute URLs from the base URL + relative paths
- Example:
https://waha.devlike.pro+/docs/apps/=https://waha.devlike.pro/docs/apps/ - Crawling should proceed successfully
Option 2 (Fallback):
- If relative URLs can't be handled, show clear error in UI
- Error message: "Sitemap contains relative URLs. Please provide a sitemap with absolute URLs or use the base URL directly."
- Task should remain visible with error state, not disappear
Actual Behavior
- Sitemap parser extracts relative URLs as-is
- Crawler receives
/docs/apps/and rejects it (not a valid URL) - Task disappears from UI without any error notification
- User has no indication of what went wrong
Error Details (if any)
[ERROR]... × /docs/apps/ | Error:
Unexpected error in _crawl_web at line 500 in crawl
(../venv/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: URL must start with 'http://', 'https://', 'file://', or 'raw:'
Code context:
495 status_code=status_code,
496 screenshot=screenshot_data,
497 get_delayed_content=None,
498 )
499 else:
500 → raise ValueError(
501 "URL must start with 'http://', 'https://', 'file://', or 'raw:'"
502 )
Affected Component
🔍 Knowledge Base / RAG
Browser & OS
Edge on Windows 10
Additional Context
Example Sitemap Entry
<url>
<loc>/docs/apps/</loc>
<lastmod>2020-10-06T08:48:45+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.5</priority>
</url>
Affected Components
- 🔍 Knowledge Base / RAG (sitemap parsing logic)
- 🖥️ Frontend UI (error handling and display)
Technical Notes
Root Causes:
- Sitemap parser (
python/src/server/services/crawling/strategies/sitemap.py) likely extracts<loc>content verbatim without checking if it's relative - URL composition logic missing: No code to combine base URL with relative paths
- UI error handling: Failed ingestion tasks disappear instead of showing error state
Suggested Fix:
- Use
urllib.parse.urljoin(base_url, relative_url)to compose absolute URLs - Validate composed URLs before passing to crawler
- Surface errors to UI instead of silent failure
Related Issues
- #130 - Surface crawling errors in UI (closed, but gap remains for this case)
- #607 - Different issue (URLs containing "sitemap" in path)
Service Status (check all that are working)
- [x] 🖥️ Frontend UI (http://localhost:3737)
- [x] ⚙️ Main Server (http://localhost:8181)
- [x] 🔗 MCP Service (localhost:8051)
- [ ] 🤖 Agents Service (http://localhost:8052)
- [ ] 💾 Supabase Database (connected)
Thanks for reporting. I just checked and can verify the error/problem.