Archon icon indicating copy to clipboard operation
Archon copied to clipboard

🐛 [Bug]: Crawling gets confused by urls ending with /sitemap

Open gideoncatz opened this issue 3 months ago • 1 comments

Archon Version

0.1.0

Bug Severity

🟡 Medium - Affects functionality

Bug Description

I was trying to crawl this URL: https://nx.dev/see-also/sitemap That page does have all the links to the relavent pages I'd like to scrape, but it is not an xml. The crawler tries to treat it like a sitemap.xml and fails.

Steps to Reproduce

From the knowledge base, try to crawl the URL https://nx.dev/see-also/sitemap

Expected Behavior

The site should have been crawled

Actual Behavior

I got an error: Crawl failed: No content was crawled from the provided URL

Error Details (if any)

archon-server      | 2025-09-06 20:31:08 | src.server.services.crawling.strategies.sitemap | INFO | Parsing sitemap: https://nx.dev/see-also/sitemap
archon-server      | 2025-09-06 20:31:08 | src.server.services.crawling.strategies.sitemap | ERROR | Error parsing sitemap XML from https://nx.dev/see-also/sitemap
archon-server      | Traceback (most recent call last):
archon-server      |   File "/app/src/server/services/crawling/strategies/sitemap.py", line 45, in parse_sitemap
archon-server      |     tree = ElementTree.fromstring(resp.content)
archon-server      |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
archon-server      |   File "/usr/local/lib/python3.12/xml/etree/ElementTree.py", line 1335, in XML
archon-server      |     parser.feed(text)
archon-server      | xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 84
archon-server      | 2025-09-06 20:31:08 | src.server.services.crawling.crawling_service | ERROR | Async crawl orchestration failed
archon-server      | Traceback (most recent call last):
archon-server      |   File "/app/src/server/services/crawling/crawling_service.py", line 361, in _async_orchestrate_crawl
archon-server      |     raise ValueError("No content was crawled from the provided URL")
archon-server      | ValueError: No content was crawled from the provided URL

Affected Component

🔍 Knowledge Base / RAG

Browser & OS

Kubuntu Linux 24.04, Chrome 139.0.7258.66 (Official Build) (64-bit)

Additional Context

No response

Service Status (check all that are working)

  • [x] 🖥️ Frontend UI (http://localhost:3737)
  • [x] ⚙️ Main Server (http://localhost:8181)
  • [x] 🔗 MCP Service (localhost:8051)
  • [ ] 🤖 Agents Service (http://localhost:8052)
  • [x] 💾 Supabase Database (connected)

gideoncatz avatar Sep 06 '25 20:09 gideoncatz

Thanks for reporting. I verfied it is buggy and created a Bugfix PR #611

leex279 avatar Sep 07 '25 12:09 leex279