🐛 [Bug]: Llms.txt not crawling fully.
Archon Version
current 8/22
Bug Severity
🟡 Medium - Affects functionality
Bug Description
So i tried crawling https://langfuse.com/llms.txt and for some reason it didnt crawl the physical urls in the llms.txt. curious if this has happened to anyone else? i also tried with other llms.txt to make sure i wasnt tripping. im using the most recent version too when i tested this. Im currently fixing this and glad to submit a PR for it if its not just me experiencing this, but just want to ensure im not the only one thats having this issue.
Docker Archon-Server Logs:
2025-08-22 05:25:20 | src.server.api_routes.socketio_handlers | INFO | ✅ [SOCKETIO] Broadcasted crawl progress for de3f8c32-60de-40c6-9460-0e134047fc07
2025-08-22 05:25:20 | src.server.services.crawling.strategies.single_page | INFO | Crawling markdown file: [https://langfuse.com/llms.txt](https://langfuse.com/llms.txt)
[FETCH]... ↓ [https://langfuse.com/llms.txt](https://langfuse.com/llms.txt) | ✓ | ⏱: 0.02s
[COMPLETE] ● [https://langfuse.com/llms.txt](https://langfuse.com/llms.txt) | ✓ | ⏱: 0.03s
2025-08-22 05:25:20 | src.server.services.crawling.strategies.single_page | INFO | Successfully crawled markdown file: [https://langfuse.com/llms.txt](https://langfuse.com/llms.txt)
2025-08-22 05:25:21 | httpx | INFO | HTTP Request: GET [https://zdpwxgixdlmlypbwjauj.supabase.co/rest/v1/archon_settings?select=%2A&category=eq.rag_strategy](https://zdpwxgixdlmlypbwjauj.supabase.co/rest/v1/archon_settings?select=%2A&category=eq.rag_strategy) "HTTP/2 200 OK"
2025-08-22 05:25:21 | search | INFO | Generating summary for langfuse.com using model: gpt-4.1-nano
Steps to Reproduce
1.goto knowledge base 2. goto +Knowledge at the top and try to crawl a llms.txt file
Expected Behavior
crawls all the urls in llms.txt
Actual Behavior
crawls just the llms.txt page.
Error Details (if any)
Affected Component
🔍 Knowledge Base / RAG
Browser & OS
chrome 139
Additional Context
No response
Service Status (check all that are working)
- [x] 🖥️ Frontend UI (http://localhost:3737)
- [x] ⚙️ Main Server (http://localhost:8181)
- [x] 🔗 MCP Service (localhost:8051)
- [x] 🤖 Agents Service (http://localhost:8052)
- [x] 💾 Supabase Database (connected)
I have fixed the issue and will submit the PR shortly, i redownloaded the main branch again to test that its not working and boom sure enough its not crawling the llms.txt / fulls-llms.txt properly. This will fix that!
@Chillbruhhh Thanks for this! By design when it receives just an llms.txt it only crawls that single page. Because it assumes it's an llms-full.txt and doesn't need to crawl recursive URLs.
I'm open for a PR to change this though! It isn't really a bug but this will make a nice enhancement.
Does your PR treat llms.txt differently than llms-full.txt? Or no need to?
@coleam00 i tested both fulls and llms txt and couldnt get llms.txt to work i downloaded the newest version twice to make sure i wasnt crazy here. i have submitted the pr for it #437 and have it fully working! the pr does exactly what we need! It intelligently detects when an llms.txt file contains links and automatically extracts and crawls them.
- llms.txt with embedded links → detects it's a link collection, extracts all URLs, and batch crawls them
smart detection:
- Filename-based detection for files like llms.txt, full-llms.txt, links.md, etc.
- Content-based detection using link density analysis (if >2% of content is links with 3+ total links)
- Supports all URL formats: markdown links text, bare URLs, autolinks
, etc.
No distinction needed between llms.txt vs llms-full.txt - it automatically handles both cases. If someone uploads an llms.txt that's actually a link collection it will crawl all the links. If it's just descriptive text, it crawls as a single page. It treats llms-full properly.
my bad cole i thought archon was suppose to crawl llms.txt and all the links for rag
@coleam00 I suppose https://llmstxt.org/ describes proper behavior of llms.txt which the pull-request referenced above has seemingly implemented.
What's the verdict here? llms-full.txt is working nice and clean. llms.txt which includes URLS - does archon recursively expands the URLs to emulate the llms-full (all in one txt) ? If not, how do you use urls like OG posted https://langfuse.com/llms.txt to effectively make code examples, so we can actually search in depth using Archon?
What's the verdict here? llms-full.txt is working nice and clean. llms.txt which includes URLS - does archon recursively expands the URLs to emulate the llms-full (all in one txt) ? If not, how do you use urls like OG posted https://langfuse.com/llms.txt to effectively make code examples, so we can actually search in depth using Archon?
@GioPetro ive opened pr #437 that fixes this, from the sounds of it, it'll be merged any day now
Closing this since we have the PR in!