🐛 [Bug]: Contextual embeddings batch processing corrupts chunks by mixing multiple contexts into single database entries

Open mikenfly opened this issue 4 months ago • 1 comments

Archon Version

0.0.0

Bug Severity

🟡 Medium - Affects functionality

Bug Description

When using OpenAI provider with USE_CONTEXTUAL_EMBEDDINGS enabled, the batch processing corrupts document chunks by mixing multiple chunks' contextual descriptions into single database entries. This makes RAG search return irrelevant results as each chunk's embedding represents multiple unrelated topics mixed together.

Steps to Reproduce

Configure OpenAI as the LLM provider in Settings
Enable USE_CONTEXTUAL_EMBEDDINGS in Settings
Crawl a website (e.g., docs.anthropic.com)
Check the content field in archon_crawled_pages table in Supabase
Observe that single chunk entries contain multiple "CHUNK X:" labels with mixed contextual descriptions

Expected Behavior

Each database row should contain only one chunk with its own contextual description and original content.

Actual Behavior

Database rows contain mixed content from multiple chunks. Example from actual database:

Introduction and navigation overview of the Anthropic documentation site, including links to key sections such as research, login, support, and developer resources related to Claude Code.

CHUNK 2: Detailed navigation and content outline of the Anthropic Claude Code documentation, covering getting started guides, SDKs, workflows, deployment, and troubleshooting resources to help developers build with Claude Code.\n\n[Anthropic home page![light logo](https://mintlify.s3.us- west-1.amazonaws.com/anthropic/logo/light.svg)... [rest of content continues with actual page content]

The contextual description for CHUNK 2 is incorrectly included in what should be CHUNK 1's database entry.

Error Details (if any)

No error messages - the corruption happens silently
   during batch processing of contextual embeddings.

Affected Component

🔍 Knowledge Base / RAG

Browser & OS

Chrome on MAc os

Additional Context

The issue appears to be in the batch processing logic in contextual_embedding_service.py. When the LLM returns multiple chunk contexts formatted as "CHUNK 1: [context]\nCHUNK 2: [context]", the parsing logic fails to properly separate them, resulting in multiple contexts being stored in single database entries. This completely breaks RAG functionality as vector searches match against corrupted embeddings.

Service Status (check all that are working)

[x] 🖥️ Frontend UI (http://localhost:3737)
[x] ⚙️ Main Server (http://localhost:8181)
[x] 🔗 MCP Service (localhost:8051)
[x] 🤖 Agents Service (http://localhost:8052)
[x] 💾 Supabase Database (connected)

Aug 21 '25 16:08 mikenfly

@coleam00 Do you have capacity to look into this one?

Thank you for reporting this @mikenfly

Sep 04 '25 13:09 Wirasm