Archon icon indicating copy to clipboard operation
Archon copied to clipboard

🐛 [Bug/Feature]: Not respecting robots.txt

Open leex279 opened this issue 4 months ago • 3 comments

Archon Version

0.1.0

Bug Severity

🟡 Medium - Affects functionality

Bug Description

The Crawler is not respecting robots.txt at the moment.

Short Claude Code Analysis: Current State

The Archon project uses Crawl4AI version 0.6.2 for web crawling, which is initialized through the CrawlerManager class. The crawler configuration includes:

  1. No robots.txt checking: There's no code that reads or parses robots.txt files
  2. No user-agent delay: The crawler doesn't implement crawl delays specified in robots.txt
  3. Aggressive crawling settings: The configuration includes options like: - --disable-web-security - --aggressive-cache-discard - Multiple performance optimizations that prioritize speed over politeness

Key Issues

  1. Missing robots.txt parser: The codebase doesn't import or use Python's urllib.robotparser or any similar library
  2. No pre-crawl validation: The crawling strategies (single page, batch, recursive) don't check robots.txt before fetching URLs
  3. User-Agent spoofing: The crawler uses a Chrome user-agent string to appear as a regular browser rather than identifying itself as a bot

Crawl4AI Library

The Crawl4AI library (v0.6.2) being used doesn't appear to have built-in robots.txt support either. The CrawlerRunConfig and BrowserConfig classes don't expose any
parameters for robots.txt compliance.

Recommendation

To make the crawler respect robots.txt, you would need to:

  1. Add Python's urllib.robotparser to check robots.txt before crawling
  2. Implement crawl delays based on Crawl-delay directives
  3. Use a proper bot user-agent that identifies the crawler
  4. Check each URL against robots.txt rules before adding it to the crawl queue

This is an important ethical and legal consideration for web crawling that should be addressed in the alpha version.

Image

Steps to Reproduce

Crawl a site with a robots.txt.

Expected Behavior

The crawler respects everything in robots.txt

Actual Behavior

see description

Error Details (if any)


Affected Component

🔍 Knowledge Base / RAG

Browser & OS

Doesnt matter

Additional Context

No response

Service Status (check all that are working)

  • [ ] 🖥️ Frontend UI (http://localhost:3737)
  • [x] ⚙️ Main Server (http://localhost:8181)
  • [ ] 🔗 MCP Service (localhost:8051)
  • [ ] 🤖 Agents Service (http://localhost:8052)
  • [ ] 💾 Supabase Database (connected)

leex279 avatar Aug 17 '25 18:08 leex279

Thanks Thomas! This will be important to add - adding it into our board.

coleam00 avatar Aug 18 '25 14:08 coleam00

tried crawling https://baserow.io/user-docs - crawling didn't process due to robots.txt So i get it's working? @coleam00 what's the best approach in this case to adding the docs (i know they also have an MCP but let's assume i prefer docs :) )

gvago avatar Oct 23 '25 06:10 gvago

@gvago the crawler does not really respect the robots.txt at the moment, but you are right, its not crawling your site, which is a bug of automatic llm-txt/sitemap discovery. I fix that together with respecting the robots.txt here.

So the answer to your question is => you can just crawl it as other pages then :)

leex279 avatar Nov 07 '25 22:11 leex279