Fix #362: Complete OpenAI API error handling + Merge PR #521

Summary

Resolves issue #362 where users experienced silent failures and empty results when OpenAI API quota was exhausted or other API errors occurred. This PR now includes all changes from PR #521 which provides the core authentication error handling fixes.

Key Improvements

🛡️ Critical Authentication Error Handling (from PR #521)

EmbeddingAuthenticationError handling added to RAG service - authentication failures now bubble up as HTTP 401 responses
Single-vector API fix - Fixed create_embedding(query) → create_embedding(text=query) call
Reranking guard improvement - Enhanced logic to use is not None instead of truthy checks
HTTP 401 responses - Knowledge API now properly catches authentication errors and returns clear error messages

🔧 Enhanced Error Handling

Comprehensive error parsing for OpenAI authentication (401), quota (429), and rate limit (429) errors
Actionable notifications tell users exactly how to resolve issues (check Settings, add credits, wait for rate limits)
Error message enhancement in progress tracking and WebSocket events
Consistent messaging across all operation types (new crawls, refresh, retry, uploads)

🔒 Security Hardening

ReDoS vulnerability fixed with input size limits (5000 chars) and bounded regex patterns
Information disclosure prevention with enhanced error message sanitization
API key protection automatically redacts keys, org IDs, project IDs, URLs from error messages
Memory leak prevention with proper timeout cleanup in frontend requests

⚡ Performance Optimizations

Request optimization with proper timeout management and cleanup
Validation caching reduces API calls and improves responsiveness
Race condition fixes ensure proper event ordering in progress tracking
Memory management improvements prevent frontend resource leaks

Before vs After

Before (Issue #362)

❌ Silent failures when OpenAI API quota exhausted
❌ Empty results with no error indication
❌ Users spent 90+ minutes debugging
❌ No validation before starting operations
❌ Wasted time crawling with invalid credentials

After (This PR + PR #521)

✅ Clear error messages for all OpenAI API failures: HTTP 401: "Invalid API key - Please check your API key in settings"
✅ No more silent failures - authentication errors immediately surface to users
✅ Actionable guidance: "Check your API key in Settings", "Add credits to your account"
✅ Complete coverage across all knowledge base operations
✅ Enhanced security with sanitized error messages
✅ Optimized performance with smart caching and cleanup

Technical Implementation

Changes from PR #521 (Now Merged)

python/src/server/services/search/rag_service.py - Added EmbeddingAuthenticationError handling + single-vector API fix
python/src/server/services/embeddings/embedding_service.py - Enhanced authentication error mapping
python/src/server/api_routes/knowledge_api.py - HTTP 401 error responses with clear messages
python/src/server/services/embeddings/embedding_exceptions.py - EmbeddingAuthenticationError class (already existed)

Additional Changes

archon-ui-main/src/services/knowledgeBaseService.ts - Enhanced error parsing and validation caching
archon-ui-main/src/pages/KnowledgeBasePage.tsx - UI improvements for error handling

Testing

[x] All existing tests pass
[x] Manual testing of error scenarios (invalid API key, quota exhaustion, rate limits)
[x] Security testing for sanitization and ReDoS prevention
[x] Performance testing of validation caching
[x] End-to-end testing of all knowledge base operations

Type of Change

[x] Bug fix (non-breaking change which fixes an issue)
[x] Security improvement (fixes vulnerabilities)
[x] Performance improvement (optimizes existing functionality)
[ ] Breaking change

Related Issues & PRs

Fixes #362 - Missing Error handling on Open AI API key Error 429
Fixes #371 - Parent issue for comprehensive error handling
Merges PR #521 - Core authentication error handling fixes

Breaking Changes

None. This is a non-breaking enhancement that improves error handling without changing existing APIs.

Security Review

Code review performed using automated security analysis
Critical vulnerabilities: 0 identified
ReDoS vulnerability: Fixed with input validation and bounded patterns
Information disclosure: Prevented with comprehensive sanitization
Overall assessment: Production-ready with security best practices

The implementation transforms the frustrating silent failure experience into clear, actionable feedback while maintaining robust security and performance standards.

Summary by CodeRabbit

New Features
- New Alert and ErrorAlert UI components and a small error-handling hook for dismissible, actionable notifications.
Improvements
- Centralized error parsing with sanitized, user-friendly messages and suggested actions.
- API surfaces now surface embedding/auth/quota issues with clearer HTTP responses and pre-flight API-key validation.
- Cache invalidation when credentials change and faster task polling.
Tests
- Expanded tests covering quota/error handling, sanitization, and fail-fast RAG behaviors.

Aug 20 '25 08:08 leex279

@claude-review

Aug 20 '25 08:08 Wirasm

Claude finished @Wirasm's task —— View job

Code Review Complete ✅

Reviewed PR #371 - Bugfix issue 362 (OpenAI error handling improvements)

Todo List:

[x] Read and analyze all changed files
[x] Review error handling implementation
[x] Check code quality and patterns
[x] Evaluate test coverage
[x] Provide comprehensive review feedback

Summary

This PR implements comprehensive OpenAI error handling to address issue #362 where users experienced silent failures during quota exhaustion. The implementation spans both backend (Python/FastAPI) and frontend (React/TypeScript) with proper error propagation and user-friendly messaging.

Previous Review Comments

First review - no previous comments

Issues Found

Total: 0 critical, 3 important, 4 minor

🟡 Important (Should Fix)

Frontend Error Handler Not Integrated - archon-ui-main/src/services/knowledgeBaseErrorHandler.ts:1 Problem: The new error handling components and services are created but not actually integrated into any UI pages or components that perform RAG queries Fix: Import and use ErrorAlert and useErrorHandler in search components, knowledge base pages, or anywhere knowledgeBaseService is called
Unused Import in Service - python/src/server/services/search/rag_service.py:383 Problem: Line 383 references use_reranking variable that doesn't exist, should be self.reranking_strategy is not None Fix: Change use_reranking to self.reranking_strategy is not None
Missing Error Display Integration - archon-ui-main/src/components/ui/ErrorAlert.tsx:1 Problem: The ErrorAlert component is well-designed but there's no evidence it's being used in actual pages that perform RAG operations Fix: Import and use in components that call knowledgeBaseService.searchKnowledgeBase() or other error-prone operations

🟢 Minor (Consider)

Documentation File Location - OPENAI_ERROR_HANDLING.md:1 Consider moving this to docs/ folder or similar for better organization
Missing TypeScript Export - archon-ui-main/src/services/knowledgeBaseErrorHandler.ts:134 Consider adding a default export or barrel export in an index file for easier importing
Error Message Consistency - python/src/server/api_routes/knowledge_api.py:764 The error messages could be more consistent in tone and structure across different error types
Test Coverage Gap - python/tests/test_openai_quota_error_handling.py:1 Missing integration test that verifies the full error flow from embedding service → RAG service → API endpoint → frontend

Security Assessment

No security issues found. The implementation properly:

Validates API responses before processing
Doesn't expose internal error details to frontend unnecessarily
Handles OpenAI API keys securely through existing credential service
Uses appropriate HTTP status codes (429, 502) for different error types

Performance Considerations

Error handling adds minimal overhead to successful requests
Proper exception propagation prevents unnecessary processing continuation
Frontend error parsing is lightweight and efficient
No performance concerns identified

Good Practices Observed

Excellent error hierarchy: Well-structured custom exception classes with proper inheritance
Comprehensive test coverage: Tests cover error propagation through all layers
User-friendly messaging: Clear, actionable error messages with specific guidance
Proper HTTP status codes: Uses correct 429 for quota/rate limits, 502 for API errors
Alpha-appropriate approach: "Fail fast and loud" aligns perfectly with alpha development principles
Type safety: Strong TypeScript types for error objects and API responses
Consistent patterns: Follows existing codebase patterns for services and components

Questionable Practices

Frontend integration incomplete: The error handling infrastructure is built but not connected to actual user workflows
Missing holistic error strategy: No clear documentation on when/where these error components should be used across the application

Test Coverage

Current Coverage: 85% (estimated based on comprehensive backend tests)

Missing Tests:

Frontend Error Handler Integration
- What to test: Error display in actual UI components during RAG queries
- Why important: Ensures users actually see the improved error messages
- Suggested test: Integration test simulating quota error in knowledge search UI
End-to-End Error Flow
- What to test: Full error propagation from OpenAI API through to frontend display
- Why important: Verifies the complete user experience improvement
- Suggested test: Mock OpenAI quota error and verify frontend shows proper error alert

Recommendations

Merge Decision:

[ ] Ready to merge as-is
[x] Requires fixes before merging

Priority Actions:

Fix the use_reranking undefined variable error in rag_service.py:383
Integrate ErrorAlert component into actual UI pages that perform RAG queries
Add usage examples or integration guidance for the frontend error handling

Rationale: This is excellent foundational work for improving error handling. The backend implementation is solid and follows alpha principles well. However, the frontend components need to be actually integrated into the user interface to provide the intended benefit. The undefined variable error should be fixed as it will cause runtime issues.

The architecture and implementation quality is high - this just needs the final integration step to be complete.

Review based on Archon V2 Alpha guidelines and CLAUDE.md principles

Aug 20 '25 08:08 claude[bot]

Walkthrough

Adds centralized OpenAI/RAG error normalization and sanitization, new UI Alert/ErrorAlert components and hook, embedding-specific exceptions and fail-fast propagation (401/429/502), API key preflight validation, provider cache invalidation on credential changes, RAG embedding call fixes, logging tweaks, and expanded tests for propagation and sanitization.

Changes

Cohort / File(s)	Change summary
UI: Alerts & Error display `archon-ui-main/src/components/ui/Alert.tsx`, `archon-ui-main/src/components/ui/ErrorAlert.tsx`, `archon-ui-main/src/components/ui/Alert.tsx`	New Alert + AlertDescription components; ErrorAlert component renders normalized EnhancedError with severity, suggested action, tokens-used and optional dismiss; exports useErrorHandler hook.
Frontend error parsing & integration `archon-ui-main/src/services/knowledgeBaseErrorHandler.ts`, `archon-ui-main/src/services/knowledgeBaseService.ts`, `archon-ui-main/src/pages/KnowledgeBasePage.tsx`	New centralized parser and helpers (parseKnowledgeBaseError, getDisplayErrorMessage, getErrorSeverity, getErrorAction); service funnels API errors through parser; KnowledgeBasePage surfaces contextual toasts and messages.
Server API: knowledge endpoints & sanitization `python/src/server/api_routes/knowledge_api.py`	Preflight OpenAI API key validation, sanitized error responses, maps embedding exceptions to HTTP 401/429/502 with structured payloads.
Embeddings: exceptions & service logic `python/src/server/services/embeddings/embedding_exceptions.py`, `python/src/server/services/embeddings/embedding_service.py`	Adds EmbeddingAuthenticationError; maps openai.AuthenticationError → EmbeddingAuthenticationError; propagates auth/quota/rate-limit errors; treats certain rate-limit messages as quota exhaustion; improves logging and early-exit on auth failures.
RAG/search service `python/src/server/services/search/rag_service.py`	Switches to single-vector call `create_embedding(text=...)`; fail-fast (raise) on embedding failures; re-raises embedding-related errors to API layer; refines reranking gating and reporting.
Provider cache & credential handling `python/src/server/services/llm_provider_service.py`, `python/src/server/services/credential_service.py`	Adds invalidate_provider_cache and invalidate_specific_cache; CredentialService.set_credential calls invalidate on credential/rag strategy changes.
Threading/logging tweak `python/src/server/services/threading_service.py`	RateLimiter log format changed to inline message; no control-flow changes.
Tests: error handling, sanitizer, metrics, RAG adjustments `python/tests/test_openai_quota_error_handling.py`, `python/tests/test_document_storage_metrics.py`, `python/tests/test_rag_simple.py`, `python/tests/test_rag_strategies.py`	New tests covering quota/auth/rate-limit/API propagation and sanitizer; stabilized metrics mocks; updated embedding call expectations to keyword arg; adapted tests to fail-fast and reranking changes.
Docs/notes `polling-analysis.md`	Polling/staleTime values adjusted (5000→2000ms base, 10000→5000ms stale).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant UI as KnowledgeBasePage (UI)
  participant FES as Frontend Service
  participant EH as Error Normalizer
  participant API as Server (knowledge_api)
  participant RAG as RAG Service
  participant EMB as Embedding Service

  UI->>FES: request (fetch/refresh/upload/crawl)
  FES->>API: HTTP request
  API->>RAG: perform_rag_query(...)
  RAG->>EMB: create_embedding(text=query)
  alt embedding auth/quota/rate-limit/api error
    EMB-->>RAG: raise Embedding*Error
    RAG-->>API: propagate specific Embedding*Error
    API->>API: _sanitize_openai_error(...)
    API-->>FES: HTTP 401/429/502 + sanitized payload
    FES->>EH: parseKnowledgeBaseError(response)
    EH-->>FES: EnhancedError (severity, action, message)
    FES-->>UI: normalized error
    UI->>UI: render ErrorAlert (variant, description, suggested action)
  else success
    EMB-->>RAG: embedding vector
    RAG-->>API: results
    API-->>FES: 200 OK
    FES-->>UI: data
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Surface OpenAI quota exhaustion with explicit errors and logs [#362, #371]	✅
Surface OpenAI authentication failures as HTTP 401 with clear messaging [#371]	✅
Frontend displays normalized, actionable error messages for KB/RAG operations [#371]	✅
Sanitize error messages to prevent sensitive data leakage [#371]	✅

Possibly related PRs

coleam00/Archon#472 — Changes to crawling/source ID and document storage flows that may interact with RAG and error propagation paths (shared areas: ingestion and doc metadata).

Suggested reviewers

coleam00
tazmon95

Poem

I thump my paw—alert bells chime,
No more silent fails at query time.
If keys run dry or auth says “no,”
My crimson banner helps you know.
Hop on—errors cleared, the logs now glow! 🐇✨

✨ Finishing Touches

[ ] 📝 Generate Docstrings

🧪 Generate unit tests

[ ] Create PR with unit tests
[ ] Post copyable unit tests in a comment
[ ] Commit unit tests in branch bugfix-issue-362

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Aug 20 '25 15:08 coderabbitai[bot]

2025-08-21_22h27_46

Aug 21 '25 20:08 leex279

looking good, but we need to hold this until we decide between WS and Polling @coleam00

Sep 01 '25 15:09 Wirasm

looking good, but we need to hold this until we decide between WS and Polling @coleam00

Yeah that makes sense! Now that we're going with polling, you're saying this PR will have to change?

Sep 02 '25 00:09 coleam00

@leex279 Would love to see these resolved and merge this pr now that the main part of the migration is done

Sep 12 '25 14:09 Wirasm

The knowledge base services would need to be migrated to the new folder structure

Sep 12 '25 14:09 Wirasm

Rebased in #650

Sep 13 '25 12:09 leex279

Fix #362: Surface OpenAI authentication errors as HTTP 401

Fix #362: Complete OpenAI API error handling + Merge PR #521

Summary

Key Improvements

🛡️ Critical Authentication Error Handling (from PR #521)

🔧 Enhanced Error Handling

🔒 Security Hardening

⚡ Performance Optimizations

Before vs After

Before (Issue #362)

After (This PR + PR #521)

Technical Implementation

Changes from PR #521 (Now Merged)

Additional Changes

Testing

Type of Change

Related Issues & PRs

Breaking Changes

Security Review

Summary by CodeRabbit

Code Review Complete ✅

Todo List:

Summary

Previous Review Comments

Issues Found

🟡 Important (Should Fix)

🟢 Minor (Consider)

Security Assessment

Performance Considerations

Good Practices Observed

Questionable Practices

Test Coverage

Recommendations

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Poem