Archon
Archon copied to clipboard
🐛 [Bug]: Crawling gets confused by urls ending with /sitemap
Archon Version
0.1.0
Bug Severity
🟡 Medium - Affects functionality
Bug Description
I was trying to crawl this URL: https://nx.dev/see-also/sitemap That page does have all the links to the relavent pages I'd like to scrape, but it is not an xml. The crawler tries to treat it like a sitemap.xml and fails.
Steps to Reproduce
From the knowledge base, try to crawl the URL https://nx.dev/see-also/sitemap
Expected Behavior
The site should have been crawled
Actual Behavior
I got an error: Crawl failed: No content was crawled from the provided URL
Error Details (if any)
archon-server | 2025-09-06 20:31:08 | src.server.services.crawling.strategies.sitemap | INFO | Parsing sitemap: https://nx.dev/see-also/sitemap
archon-server | 2025-09-06 20:31:08 | src.server.services.crawling.strategies.sitemap | ERROR | Error parsing sitemap XML from https://nx.dev/see-also/sitemap
archon-server | Traceback (most recent call last):
archon-server | File "/app/src/server/services/crawling/strategies/sitemap.py", line 45, in parse_sitemap
archon-server | tree = ElementTree.fromstring(resp.content)
archon-server | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
archon-server | File "/usr/local/lib/python3.12/xml/etree/ElementTree.py", line 1335, in XML
archon-server | parser.feed(text)
archon-server | xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 84
archon-server | 2025-09-06 20:31:08 | src.server.services.crawling.crawling_service | ERROR | Async crawl orchestration failed
archon-server | Traceback (most recent call last):
archon-server | File "/app/src/server/services/crawling/crawling_service.py", line 361, in _async_orchestrate_crawl
archon-server | raise ValueError("No content was crawled from the provided URL")
archon-server | ValueError: No content was crawled from the provided URL
Affected Component
🔍 Knowledge Base / RAG
Browser & OS
Kubuntu Linux 24.04, Chrome 139.0.7258.66 (Official Build) (64-bit)
Additional Context
No response
Service Status (check all that are working)
- [x] 🖥️ Frontend UI (http://localhost:3737)
- [x] ⚙️ Main Server (http://localhost:8181)
- [x] 🔗 MCP Service (localhost:8051)
- [ ] 🤖 Agents Service (http://localhost:8052)
- [x] 💾 Supabase Database (connected)
Thanks for reporting. I verfied it is buggy and created a Bugfix PR #611