Archon 🐛 [Bug/Feature]: Not respecting robots.txt

Archon Version

0.1.0

Bug Severity

🟡 Medium - Affects functionality

Bug Description

The Crawler is not respecting robots.txt at the moment.

Short Claude Code Analysis: Current State

The Archon project uses Crawl4AI version 0.6.2 for web crawling, which is initialized through the CrawlerManager class. The crawler configuration includes:

No robots.txt checking: There's no code that reads or parses robots.txt files
No user-agent delay: The crawler doesn't implement crawl delays specified in robots.txt
Aggressive crawling settings: The configuration includes options like: - --disable-web-security - --aggressive-cache-discard - Multiple performance optimizations that prioritize speed over politeness

Key Issues

Missing robots.txt parser: The codebase doesn't import or use Python's urllib.robotparser or any similar library
No pre-crawl validation: The crawling strategies (single page, batch, recursive) don't check robots.txt before fetching URLs
User-Agent spoofing: The crawler uses a Chrome user-agent string to appear as a regular browser rather than identifying itself as a bot

Crawl4AI Library

The Crawl4AI library (v0.6.2) being used doesn't appear to have built-in robots.txt support either. The CrawlerRunConfig and BrowserConfig classes don't expose any
parameters for robots.txt compliance.

Recommendation

To make the crawler respect robots.txt, you would need to:

Add Python's urllib.robotparser to check robots.txt before crawling
Implement crawl delays based on Crawl-delay directives
Use a proper bot user-agent that identifies the crawler
Check each URL against robots.txt rules before adding it to the crawl queue

This is an important ethical and legal consideration for web crawling that should be addressed in the alpha version.

Steps to Reproduce

Crawl a site with a robots.txt.

Expected Behavior

The crawler respects everything in robots.txt

Actual Behavior

see description

Error Details (if any)

Affected Component

🔍 Knowledge Base / RAG

Browser & OS

Doesnt matter

Additional Context

No response

Service Status (check all that are working)

[ ] 🖥️ Frontend UI (http://localhost:3737)
[x] ⚙️ Main Server (http://localhost:8181)
[ ] 🔗 MCP Service (localhost:8051)
[ ] 🤖 Agents Service (http://localhost:8052)
[ ] 💾 Supabase Database (connected)

Aug 17 '25 18:08 leex279

Thanks Thomas! This will be important to add - adding it into our board.

Aug 18 '25 14:08 coleam00

tried crawling https://baserow.io/user-docs - crawling didn't process due to robots.txt So i get it's working? @coleam00 what's the best approach in this case to adding the docs (i know they also have an MCP but let's assume i prefer docs :) )

Oct 23 '25 06:10 gvago

@gvago the crawler does not really respect the robots.txt at the moment, but you are right, its not crawling your site, which is a bug of automatic llm-txt/sitemap discovery. I fix that together with respecting the robots.txt here.

So the answer to your question is => you can just crawl it as other pages then :)

Nov 07 '25 22:11 leex279