🐛 [Bug]: Crawler inserts spaces after/around slashes in code examples
Archon Version
v0.1.0 with reranking RAG strategy enabled
Bug Severity
🟢 Low - Minor inconvenience
Bug Description
After crawling https://nextjs.org/docs/llms-full.txt, I noticed that all import paths and URLs in the code examples have been 'enriched' with unexpected spaces after and around subsequent slashes (/). I'm not sure if it also modifies other parts of the code.
Examples:
import db from '@/lib/db'is stored asimport db from '@/ lib / db'(it now has an additional space after the first slash and around the second one).const res = await fetch('https://.../item/1')is stored asconst res = await fetch('https://.../ item /1')(additional space before and afteritem).
What I've discovered so far:
- The unmodified code block is send to the LLM when asked for 'example names and summaries'.
- The space-enriched code block is stored in the
archon_code_examplestable, which seems to indicate that the code is being modified somewhere between the LLM analysis and database insertion.
Steps to Reproduce
- Add
https://nextjs.org/docs/llms-full.txt(or any other file containing code examples with slashes) to the Knowledge Base. - After completing the indexation, open the Code Browser.
- Search for
importorURLand check the path-string.
Expected Behavior
It should not modify the code examples.
Actual Behavior
It injects unexpected spaces, rendering the code examples invalid.
Error Details (if any)
Affected Component
🔍 Knowledge Base / RAG
Browser & OS
Firefox 142, macOS 15.6, Docker Engine 28.3.2
Additional Context
Service Status (check all that are working)
- [x] 🖥️ Frontend UI (http://localhost:3737)
- [x] ⚙️ Main Server (http://localhost:8181)
- [x] 🔗 MCP Service (localhost:8051)
- [x] 🤖 Agents Service (http://localhost:8052)
- [x] 💾 Supabase Database (connected)
Thx for reporting. I tested my self and could reproduce it.
@Team Quick Code Analysis + possible solutions
Click here to see the full Report
Bug Analysis Report: Crawler Space Injection in Code ExamplesIssue Summary
The crawler is inserting unwanted spaces around slashes (/) in code examples, making import paths and URLs invalid. For example:
- import db from '@/lib/db' becomes import db from '@/ lib / db'
- fetch('https://.../item/1') becomes fetch('https://.../ item /1')
Root Cause Analysis
The issue occurs in the _decode_html_entities method in python/src/server/services/crawling/code_extraction_service.py (lines 1053-1105).
Specific Problem Location:
Lines 1059-1068
if "<span" in text: # This indicates syntax highlighting - preserve the structure text = re.sub(r"", "", text) text = re.sub(r"<span[^>]>", "", text) else: # Normal span usage - might need spacing # Only add space if there isn't already whitespace text = re.sub(r"(?=[A-Za-z0-9])", " ", text) # ← PROBLEM LINE text = re.sub(r"<span[^>]>", "", text)
Why This Happens
- HTML Structure: Many documentation sites use syntax highlighting that wraps individual tokens in tags. For paths like /lib/db, the HTML might look like: /lib/db
- Condition Check Failure: The current code only checks for <span pattern to detect syntax highlighting. It doesn't account for patterns like / or /.
- Space Injection: When the condition fails, the regex r"(?=[A-Za-z0-9])" adds a space after when followed by alphanumeric characters, causing: - /lib → / lib (space added after slash) - lib/db → lib / db (spaces around slash)
Fix Options
Option 1: Enhanced Pattern Detection (Recommended)
Improve the condition to detect more syntax highlighting patterns:
def _decode_html_entities(self, text: str) -> str: """Decode common HTML entities and clean HTML tags from code.""" import re
# Check for various syntax highlighting patterns
syntax_highlight_patterns = [
"</span><span", # Adjacent spans
"</span>/", # Span before slash
"/</span>", # Span after slash
"</span>.", # Span before dot (for chained methods)
".</span>", # Span after dot
"</span>:", # Span before colon
":</span>", # Span after colon
]
has_syntax_highlighting = any(pattern in text for pattern in syntax_highlight_patterns)
if has_syntax_highlighting:
# Remove spans without adding spaces
text = re.sub(r"</span>", "", text)
text = re.sub(r"<span[^>]*>", "", text)
else:
# Normal span usage - add space only when truly needed
text = re.sub(r"</span>(?=[A-Za-z0-9])", " ", text)
text = re.sub(r"<span[^>]*>", "", text)
Option 2: Smarter Space Insertion
Only add spaces where they make semantic sense:
def _decode_html_entities(self, text: str) -> str: """Decode common HTML entities and clean HTML tags from code.""" import re
# Remove spans but preserve the content structure
# Don't add spaces around programming punctuation
programming_chars = r"[/\.\:\-\>\<\=\+\*\&\|\^\%\!\@\#\$\(\)\[\]\{\}\\]"
# Only add space if not adjacent to programming punctuation
text = re.sub(
rf"</span>(?![{programming_chars}\s]|<span|$)(?=[A-Za-z0-9])",
" ",
text
)
text = re.sub(r"<span[^>]*>", "", text)
text = re.sub(r"</span>", "", text) # Remove remaining </span> tags
Option 3: Post-Processing Fix
Clean up known problematic patterns after span removal:
def _decode_html_entities(self, text: str) -> str: """Decode common HTML entities and clean HTML tags from code.""" import re
# Current span removal logic
if "</span><span" in text:
text = re.sub(r"</span>", "", text)
text = re.sub(r"<span[^>]*>", "", text)
else:
text = re.sub(r"</span>(?=[A-Za-z0-9])", " ", text)
text = re.sub(r"<span[^>]*>", "", text)
# Fix common space injection issues
text = re.sub(r"\s*/\s*", "/", text) # Remove spaces around slashes
text = re.sub(r"\s*\.\s*", ".", text) # Remove spaces around dots
text = re.sub(r"\s*:\s*", ":", text) # Remove spaces around colons (careful with Python)
# Fix specific patterns
text = re.sub(r"@\s*/", "@/", text) # Fix Next.js import aliases
text = re.sub(r"https?\s*:\s*/\s*/", lambda m: m.group(0).replace(" ", ""), text) # Fix URLs
Option 4: Complete Rewrite with Better Logic
Use a more sophisticated approach that understands code structure:
def _decode_html_entities(self, text: str) -> str: """Decode common HTML entities and clean HTML tags from code.""" import re
# First pass: Mark positions where spans were removed
# This helps us understand the original structure
markers = []
def mark_span_removal(match):
markers.append(match.span())
return ""
# Remove all span tags and track their positions
text = re.sub(r"<span[^>]*>|</span>", "", text)
# No need to add spaces - the original text structure is preserved
# The spans were just wrapping existing characters
Recommended Solution
Option 1 is the most balanced approach because:
- It's a minimal change to existing code
- It handles the specific cases causing issues
- It preserves the existing logic for non-syntax-highlighted content
- It's easy to extend with more patterns if needed
Testing Recommendations
After implementing the fix, test with:
- Next.js documentation (https://nextjs.org/docs/llms-full.txt)
- URLs with various protocols
- Import statements with different alias patterns (@/, ~/, ../)
- Method chaining (object.method1().method2())
- File paths on different OS (C:\path\file, /usr/bin/)
Impact Assessment
- Severity: Medium - Corrupts stored code examples making them syntactically invalid
- Scope: Affects all crawled documentation with syntax-highlighted code
- Data Recovery: Existing corrupted entries need re-crawling after fix
This is a sneaky one! Actually seen this bug before in several RAG and agent platforms, especially where code or path snippets are preprocessed for chunking or “safe” display. It’s usually not the LLM or markdown renderer itself, but a normalization step that tries to “escape” or format slashes, and accidentally injects whitespace on line/paragraph boundaries.
If you want a root-cause checklist (we’ve catalogued this as ProblemMap No.10: “token drift / chunk boundary corruption”), I’m happy to share it—it includes quick pattern checks and safe preprocessing fixes. This pattern comes up often when bridging between code/documentation and retrieval layers.
If you’re interested, I can drop the checklist here (MIT, free, no infra changes—just patch your pipeline and done). Let me know—this one’s fully solvable, and lots of teams have hit the same gotcha!
Thanks for reporting this @ewildee! Adding this to the board
will review this, i might have fixed it in #514 but needs manual checking
Current State => not fixed