🐛 [Bug]: Contextual embeddings batch processing corrupts chunks by mixing multiple contexts into single database entries
Archon Version
0.0.0
Bug Severity
🟡 Medium - Affects functionality
Bug Description
When using OpenAI provider with USE_CONTEXTUAL_EMBEDDINGS enabled, the batch processing corrupts document chunks by mixing multiple chunks' contextual descriptions into single database entries. This makes RAG search return irrelevant results as each chunk's embedding represents multiple unrelated topics mixed together.
Steps to Reproduce
- Configure OpenAI as the LLM provider in Settings
- Enable USE_CONTEXTUAL_EMBEDDINGS in Settings
- Crawl a website (e.g., docs.anthropic.com)
- Check the content field in archon_crawled_pages table in Supabase
- Observe that single chunk entries contain multiple "CHUNK X:" labels with mixed contextual descriptions
Expected Behavior
Each database row should contain only one chunk with its own contextual description and original content.
Actual Behavior
Database rows contain mixed content from multiple chunks. Example from actual database:
Introduction and navigation overview of the Anthropic documentation site, including links to key sections such as research, login, support, and developer resources related to Claude Code.
CHUNK 2: Detailed navigation and content outline of the Anthropic Claude Code documentation, covering getting started guides, SDKs, workflows, deployment, and troubleshooting resources to help developers build with Claude Code.\n\n[Anthropic home page... [rest of content continues with actual page content]
The contextual description for CHUNK 2 is incorrectly included in what should be CHUNK 1's database entry.
Error Details (if any)
No error messages - the corruption happens silently
during batch processing of contextual embeddings.
Affected Component
🔍 Knowledge Base / RAG
Browser & OS
Chrome on MAc os
Additional Context
The issue appears to be in the batch processing logic in contextual_embedding_service.py. When the LLM returns multiple chunk contexts formatted as "CHUNK 1: [context]\nCHUNK 2: [context]", the parsing logic fails to properly separate them, resulting in multiple contexts being stored in single database entries. This completely breaks RAG functionality as vector searches match against corrupted embeddings.
Service Status (check all that are working)
- [x] 🖥️ Frontend UI (http://localhost:3737)
- [x] ⚙️ Main Server (http://localhost:8181)
- [x] 🔗 MCP Service (localhost:8051)
- [x] 🤖 Agents Service (http://localhost:8052)
- [x] 💾 Supabase Database (connected)
@coleam00 Do you have capacity to look into this one?
Thank you for reporting this @mikenfly