🐛 [Bug/Feature]: Not respecting robots.txt
Archon Version
0.1.0
Bug Severity
🟡 Medium - Affects functionality
Bug Description
The Crawler is not respecting robots.txt at the moment.
Short Claude Code Analysis: Current State
The Archon project uses Crawl4AI version 0.6.2 for web crawling, which is initialized through the CrawlerManager class. The crawler configuration includes:
- No robots.txt checking: There's no code that reads or parses robots.txt files
- No user-agent delay: The crawler doesn't implement crawl delays specified in robots.txt
- Aggressive crawling settings: The configuration includes options like: - --disable-web-security - --aggressive-cache-discard - Multiple performance optimizations that prioritize speed over politeness
Key Issues
- Missing robots.txt parser: The codebase doesn't import or use Python's urllib.robotparser or any similar library
- No pre-crawl validation: The crawling strategies (single page, batch, recursive) don't check robots.txt before fetching URLs
- User-Agent spoofing: The crawler uses a Chrome user-agent string to appear as a regular browser rather than identifying itself as a bot
Crawl4AI Library
The Crawl4AI library (v0.6.2) being used doesn't appear to have built-in robots.txt support either. The CrawlerRunConfig and BrowserConfig classes don't expose any
parameters for robots.txt compliance.
Recommendation
To make the crawler respect robots.txt, you would need to:
- Add Python's urllib.robotparser to check robots.txt before crawling
- Implement crawl delays based on Crawl-delay directives
- Use a proper bot user-agent that identifies the crawler
- Check each URL against robots.txt rules before adding it to the crawl queue
This is an important ethical and legal consideration for web crawling that should be addressed in the alpha version.
Steps to Reproduce
Crawl a site with a robots.txt.
Expected Behavior
The crawler respects everything in robots.txt
Actual Behavior
see description
Error Details (if any)
Affected Component
🔍 Knowledge Base / RAG
Browser & OS
Doesnt matter
Additional Context
No response
Service Status (check all that are working)
- [ ] 🖥️ Frontend UI (http://localhost:3737)
- [x] ⚙️ Main Server (http://localhost:8181)
- [ ] 🔗 MCP Service (localhost:8051)
- [ ] 🤖 Agents Service (http://localhost:8052)
- [ ] 💾 Supabase Database (connected)
Thanks Thomas! This will be important to add - adding it into our board.
tried crawling https://baserow.io/user-docs - crawling didn't process due to robots.txt So i get it's working? @coleam00 what's the best approach in this case to adding the docs (i know they also have an MCP but let's assume i prefer docs :) )
@gvago the crawler does not really respect the robots.txt at the moment, but you are right, its not crawling your site, which is a bug of automatic llm-txt/sitemap discovery. I fix that together with respecting the robots.txt here.
So the answer to your question is => you can just crawl it as other pages then :)