Archon icon indicating copy to clipboard operation
Archon copied to clipboard

🐛 [Bug]: Llms.txt not crawling fully.

Open Chillbruhhh opened this issue 4 months ago • 7 comments

Archon Version

current 8/22

Bug Severity

🟡 Medium - Affects functionality

Bug Description

So i tried crawling https://langfuse.com/llms.txt and for some reason it didnt crawl the physical urls in the llms.txt. curious if this has happened to anyone else? i also tried with other llms.txt to make sure i wasnt tripping. im using the most recent version too when i tested this. Im currently fixing this and glad to submit a PR for it if its not just me experiencing this, but just want to ensure im not the only one thats having this issue.

Image

Docker Archon-Server Logs:

2025-08-22 05:25:20 | src.server.api_routes.socketio_handlers | INFO | ✅ [SOCKETIO] Broadcasted crawl progress for de3f8c32-60de-40c6-9460-0e134047fc07

2025-08-22 05:25:20 | src.server.services.crawling.strategies.single_page | INFO | Crawling markdown file: [https://langfuse.com/llms.txt⁠](https://langfuse.com/llms.txt)

[FETCH]... ↓ [https://langfuse.com/llms.txt⁠](https://langfuse.com/llms.txt)                                                                        | ✓ | ⏱: 0.02s

[COMPLETE] ● [https://langfuse.com/llms.txt⁠](https://langfuse.com/llms.txt)                                                                        | ✓ | ⏱: 0.03s

2025-08-22 05:25:20 | src.server.services.crawling.strategies.single_page | INFO | Successfully crawled markdown file: [https://langfuse.com/llms.txt⁠](https://langfuse.com/llms.txt)

2025-08-22 05:25:21 | httpx | INFO | HTTP Request: GET [https://zdpwxgixdlmlypbwjauj.supabase.co/rest/v1/archon_settings?select=%2A&category=eq.rag_strategy⁠](https://zdpwxgixdlmlypbwjauj.supabase.co/rest/v1/archon_settings?select=%2A&category=eq.rag_strategy) "HTTP/2 200 OK"

2025-08-22 05:25:21 | search | INFO | Generating summary for langfuse.com using model: gpt-4.1-nano

Steps to Reproduce

1.goto knowledge base 2. goto +Knowledge at the top and try to crawl a llms.txt file

Expected Behavior

crawls all the urls in llms.txt

Actual Behavior

crawls just the llms.txt page.

Error Details (if any)


Affected Component

🔍 Knowledge Base / RAG

Browser & OS

chrome 139

Additional Context

No response

Service Status (check all that are working)

  • [x] 🖥️ Frontend UI (http://localhost:3737)
  • [x] ⚙️ Main Server (http://localhost:8181)
  • [x] 🔗 MCP Service (localhost:8051)
  • [x] 🤖 Agents Service (http://localhost:8052)
  • [x] 💾 Supabase Database (connected)

Chillbruhhh avatar Aug 22 '25 05:08 Chillbruhhh

Image

I have fixed the issue and will submit the PR shortly, i redownloaded the main branch again to test that its not working and boom sure enough its not crawling the llms.txt / fulls-llms.txt properly. This will fix that!

Chillbruhhh avatar Aug 22 '25 07:08 Chillbruhhh

@Chillbruhhh Thanks for this! By design when it receives just an llms.txt it only crawls that single page. Because it assumes it's an llms-full.txt and doesn't need to crawl recursive URLs.

I'm open for a PR to change this though! It isn't really a bug but this will make a nice enhancement.

Does your PR treat llms.txt differently than llms-full.txt? Or no need to?

coleam00 avatar Aug 22 '25 12:08 coleam00

@coleam00 i tested both fulls and llms txt and couldnt get llms.txt to work i downloaded the newest version twice to make sure i wasnt crazy here. i have submitted the pr for it #437 and have it fully working! the pr does exactly what we need! It intelligently detects when an llms.txt file contains links and automatically extracts and crawls them.

  • llms.txt with embedded links → detects it's a link collection, extracts all URLs, and batch crawls them

smart detection:

  • Filename-based detection for files like llms.txt, full-llms.txt, links.md, etc.
  • Content-based detection using link density analysis (if >2% of content is links with 3+ total links)
  • Supports all URL formats: markdown links text, bare URLs, autolinks , etc.

No distinction needed between llms.txt vs llms-full.txt - it automatically handles both cases. If someone uploads an llms.txt that's actually a link collection it will crawl all the links. If it's just descriptive text, it crawls as a single page. It treats llms-full properly.

Chillbruhhh avatar Aug 22 '25 12:08 Chillbruhhh

my bad cole i thought archon was suppose to crawl llms.txt and all the links for rag

Chillbruhhh avatar Aug 22 '25 12:08 Chillbruhhh

@coleam00 I suppose https://llmstxt.org/ describes proper behavior of llms.txt which the pull-request referenced above has seemingly implemented.

OlegZee avatar Sep 02 '25 17:09 OlegZee

What's the verdict here? llms-full.txt is working nice and clean. llms.txt which includes URLS - does archon recursively expands the URLs to emulate the llms-full (all in one txt) ? If not, how do you use urls like OG posted https://langfuse.com/llms.txt to effectively make code examples, so we can actually search in depth using Archon?

GioPetro avatar Sep 06 '25 09:09 GioPetro

What's the verdict here? llms-full.txt is working nice and clean. llms.txt which includes URLS - does archon recursively expands the URLs to emulate the llms-full (all in one txt) ? If not, how do you use urls like OG posted https://langfuse.com/llms.txt to effectively make code examples, so we can actually search in depth using Archon?

@GioPetro ive opened pr #437 that fixes this, from the sounds of it, it'll be merged any day now

Chillbruhhh avatar Sep 06 '25 09:09 Chillbruhhh

Closing this since we have the PR in!

coleam00 avatar Sep 20 '25 18:09 coleam00