Archon icon indicating copy to clipboard operation
Archon copied to clipboard

🐛 [Bug]: Crawler inserts spaces after/around slashes in code examples

Open ewildee opened this issue 4 months ago • 5 comments

Archon Version

v0.1.0 with reranking RAG strategy enabled

Bug Severity

🟢 Low - Minor inconvenience

Bug Description

After crawling https://nextjs.org/docs/llms-full.txt, I noticed that all import paths and URLs in the code examples have been 'enriched' with unexpected spaces after and around subsequent slashes (/). I'm not sure if it also modifies other parts of the code.

Examples:

  • import db from '@/lib/db' is stored as import db from '@/ lib / db' (it now has an additional space after the first slash and around the second one).
  • const res = await fetch('https://.../item/1') is stored as const res = await fetch('https://.../ item /1') (additional space before and after item).

What I've discovered so far:

  • The unmodified code block is send to the LLM when asked for 'example names and summaries'.
  • The space-enriched code block is stored in the archon_code_examples table, which seems to indicate that the code is being modified somewhere between the LLM analysis and database insertion.

Steps to Reproduce

  1. Add https://nextjs.org/docs/llms-full.txt (or any other file containing code examples with slashes) to the Knowledge Base.
  2. After completing the indexation, open the Code Browser.
  3. Search for import or URL and check the path-string.

Expected Behavior

It should not modify the code examples.

Actual Behavior

It injects unexpected spaces, rendering the code examples invalid.

Error Details (if any)


Affected Component

🔍 Knowledge Base / RAG

Browser & OS

Firefox 142, macOS 15.6, Docker Engine 28.3.2

Additional Context

Image Image
Image Image

Service Status (check all that are working)

  • [x] 🖥️ Frontend UI (http://localhost:3737)
  • [x] ⚙️ Main Server (http://localhost:8181)
  • [x] 🔗 MCP Service (localhost:8051)
  • [x] 🤖 Agents Service (http://localhost:8052)
  • [x] 💾 Supabase Database (connected)

ewildee avatar Aug 17 '25 11:08 ewildee

Thx for reporting. I tested my self and could reproduce it.

Image

@Team Quick Code Analysis + possible solutions

Click here to see the full Report Bug Analysis Report: Crawler Space Injection in Code Examples

Issue Summary

The crawler is inserting unwanted spaces around slashes (/) in code examples, making import paths and URLs invalid. For example:

  • import db from '@/lib/db' becomes import db from '@/ lib / db'
  • fetch('https://.../item/1') becomes fetch('https://.../ item /1')

Root Cause Analysis

The issue occurs in the _decode_html_entities method in python/src/server/services/crawling/code_extraction_service.py (lines 1053-1105).

Specific Problem Location:

Lines 1059-1068

if "<span" in text: # This indicates syntax highlighting - preserve the structure text = re.sub(r"", "", text) text = re.sub(r"<span[^>]>", "", text) else: # Normal span usage - might need spacing # Only add space if there isn't already whitespace text = re.sub(r"(?=[A-Za-z0-9])", " ", text) # ← PROBLEM LINE text = re.sub(r"<span[^>]>", "", text)

Why This Happens

  1. HTML Structure: Many documentation sites use syntax highlighting that wraps individual tokens in tags. For paths like /lib/db, the HTML might look like: /lib/db
  2. Condition Check Failure: The current code only checks for <span pattern to detect syntax highlighting. It doesn't account for patterns like / or /.
  3. Space Injection: When the condition fails, the regex r"(?=[A-Za-z0-9])" adds a space after when followed by alphanumeric characters, causing: - /lib → / lib (space added after slash) - lib/db → lib / db (spaces around slash)

Fix Options

Option 1: Enhanced Pattern Detection (Recommended)

Improve the condition to detect more syntax highlighting patterns:

def _decode_html_entities(self, text: str) -> str: """Decode common HTML entities and clean HTML tags from code.""" import re

  # Check for various syntax highlighting patterns
  syntax_highlight_patterns = [
      "</span><span",  # Adjacent spans
      "</span>/",      # Span before slash
      "/</span>",      # Span after slash
      "</span>.",      # Span before dot (for chained methods)
      ".</span>",      # Span after dot
      "</span>:",      # Span before colon
      ":</span>",      # Span after colon
  ]

  has_syntax_highlighting = any(pattern in text for pattern in syntax_highlight_patterns)

  if has_syntax_highlighting:
      # Remove spans without adding spaces
      text = re.sub(r"</span>", "", text)
      text = re.sub(r"<span[^>]*>", "", text)
  else:
      # Normal span usage - add space only when truly needed
      text = re.sub(r"</span>(?=[A-Za-z0-9])", " ", text)
      text = re.sub(r"<span[^>]*>", "", text)

Option 2: Smarter Space Insertion

Only add spaces where they make semantic sense:

def _decode_html_entities(self, text: str) -> str: """Decode common HTML entities and clean HTML tags from code.""" import re

  # Remove spans but preserve the content structure
  # Don't add spaces around programming punctuation
  programming_chars = r"[/\.\:\-\>\<\=\+\*\&\|\^\%\!\@\#\$\(\)\[\]\{\}\\]"

  # Only add space if not adjacent to programming punctuation
  text = re.sub(
      rf"</span>(?![{programming_chars}\s]|<span|$)(?=[A-Za-z0-9])",
      " ",
      text
  )
  text = re.sub(r"<span[^>]*>", "", text)
  text = re.sub(r"</span>", "", text)  # Remove remaining </span> tags

Option 3: Post-Processing Fix

Clean up known problematic patterns after span removal:

def _decode_html_entities(self, text: str) -> str: """Decode common HTML entities and clean HTML tags from code.""" import re

  # Current span removal logic
  if "</span><span" in text:
      text = re.sub(r"</span>", "", text)
      text = re.sub(r"<span[^>]*>", "", text)
  else:
      text = re.sub(r"</span>(?=[A-Za-z0-9])", " ", text)
      text = re.sub(r"<span[^>]*>", "", text)

  # Fix common space injection issues
  text = re.sub(r"\s*/\s*", "/", text)  # Remove spaces around slashes
  text = re.sub(r"\s*\.\s*", ".", text)  # Remove spaces around dots
  text = re.sub(r"\s*:\s*", ":", text)  # Remove spaces around colons (careful with Python)

  # Fix specific patterns
  text = re.sub(r"@\s*/", "@/", text)  # Fix Next.js import aliases
  text = re.sub(r"https?\s*:\s*/\s*/", lambda m: m.group(0).replace(" ", ""), text)  # Fix URLs

Option 4: Complete Rewrite with Better Logic

Use a more sophisticated approach that understands code structure:

def _decode_html_entities(self, text: str) -> str: """Decode common HTML entities and clean HTML tags from code.""" import re

  # First pass: Mark positions where spans were removed
  # This helps us understand the original structure
  markers = []

  def mark_span_removal(match):
      markers.append(match.span())
      return ""

  # Remove all span tags and track their positions
  text = re.sub(r"<span[^>]*>|</span>", "", text)

  # No need to add spaces - the original text structure is preserved
  # The spans were just wrapping existing characters

Recommended Solution

Option 1 is the most balanced approach because:

  1. It's a minimal change to existing code
  2. It handles the specific cases causing issues
  3. It preserves the existing logic for non-syntax-highlighted content
  4. It's easy to extend with more patterns if needed

Testing Recommendations

After implementing the fix, test with:

  1. Next.js documentation (https://nextjs.org/docs/llms-full.txt)
  2. URLs with various protocols
  3. Import statements with different alias patterns (@/, ~/, ../)
  4. Method chaining (object.method1().method2())
  5. File paths on different OS (C:\path\file, /usr/bin/)

Impact Assessment

  • Severity: Medium - Corrupts stored code examples making them syntactically invalid
  • Scope: Affects all crawled documentation with syntax-highlighted code
  • Data Recovery: Existing corrupted entries need re-crawling after fix

leex279 avatar Aug 17 '25 20:08 leex279

This is a sneaky one! Actually seen this bug before in several RAG and agent platforms, especially where code or path snippets are preprocessed for chunking or “safe” display. It’s usually not the LLM or markdown renderer itself, but a normalization step that tries to “escape” or format slashes, and accidentally injects whitespace on line/paragraph boundaries.

If you want a root-cause checklist (we’ve catalogued this as ProblemMap No.10: “token drift / chunk boundary corruption”), I’m happy to share it—it includes quick pattern checks and safe preprocessing fixes. This pattern comes up often when bridging between code/documentation and retrieval layers.

If you’re interested, I can drop the checklist here (MIT, free, no infra changes—just patch your pipeline and done). Let me know—this one’s fully solvable, and lots of teams have hit the same gotcha!

onestardao avatar Aug 18 '25 05:08 onestardao

Thanks for reporting this @ewildee! Adding this to the board

coleam00 avatar Aug 18 '25 14:08 coleam00

will review this, i might have fixed it in #514 but needs manual checking

Wirasm avatar Sep 04 '25 10:09 Wirasm

Current State => not fixed

Image

leex279 avatar Nov 05 '25 22:11 leex279