Fix #362: Surface OpenAI authentication errors as HTTP 401
Fix #362: Complete OpenAI API error handling + Merge PR #521
Summary
Resolves issue #362 where users experienced silent failures and empty results when OpenAI API quota was exhausted or other API errors occurred. This PR now includes all changes from PR #521 which provides the core authentication error handling fixes.
Key Improvements
π‘οΈ Critical Authentication Error Handling (from PR #521)
- EmbeddingAuthenticationError handling added to RAG service - authentication failures now bubble up as HTTP 401 responses
- Single-vector API fix - Fixed
create_embedding(query)βcreate_embedding(text=query)call - Reranking guard improvement - Enhanced logic to use
is not Noneinstead of truthy checks - HTTP 401 responses - Knowledge API now properly catches authentication errors and returns clear error messages
π§ Enhanced Error Handling
- Comprehensive error parsing for OpenAI authentication (401), quota (429), and rate limit (429) errors
- Actionable notifications tell users exactly how to resolve issues (check Settings, add credits, wait for rate limits)
- Error message enhancement in progress tracking and WebSocket events
- Consistent messaging across all operation types (new crawls, refresh, retry, uploads)
π Security Hardening
- ReDoS vulnerability fixed with input size limits (5000 chars) and bounded regex patterns
- Information disclosure prevention with enhanced error message sanitization
- API key protection automatically redacts keys, org IDs, project IDs, URLs from error messages
- Memory leak prevention with proper timeout cleanup in frontend requests
β‘ Performance Optimizations
- Request optimization with proper timeout management and cleanup
- Validation caching reduces API calls and improves responsiveness
- Race condition fixes ensure proper event ordering in progress tracking
- Memory management improvements prevent frontend resource leaks
Before vs After
Before (Issue #362)
- β Silent failures when OpenAI API quota exhausted
- β Empty results with no error indication
- β Users spent 90+ minutes debugging
- β No validation before starting operations
- β Wasted time crawling with invalid credentials
After (This PR + PR #521)
- β
Clear error messages for all OpenAI API failures:
HTTP 401: "Invalid API key - Please check your API key in settings" - β No more silent failures - authentication errors immediately surface to users
- β Actionable guidance: "Check your API key in Settings", "Add credits to your account"
- β Complete coverage across all knowledge base operations
- β Enhanced security with sanitized error messages
- β Optimized performance with smart caching and cleanup
Technical Implementation
Changes from PR #521 (Now Merged)
python/src/server/services/search/rag_service.py- Added EmbeddingAuthenticationError handling + single-vector API fixpython/src/server/services/embeddings/embedding_service.py- Enhanced authentication error mappingpython/src/server/api_routes/knowledge_api.py- HTTP 401 error responses with clear messagespython/src/server/services/embeddings/embedding_exceptions.py- EmbeddingAuthenticationError class (already existed)
Additional Changes
archon-ui-main/src/services/knowledgeBaseService.ts- Enhanced error parsing and validation cachingarchon-ui-main/src/pages/KnowledgeBasePage.tsx- UI improvements for error handling
Testing
- [x] All existing tests pass
- [x] Manual testing of error scenarios (invalid API key, quota exhaustion, rate limits)
- [x] Security testing for sanitization and ReDoS prevention
- [x] Performance testing of validation caching
- [x] End-to-end testing of all knowledge base operations
Type of Change
- [x] Bug fix (non-breaking change which fixes an issue)
- [x] Security improvement (fixes vulnerabilities)
- [x] Performance improvement (optimizes existing functionality)
- [ ] Breaking change
Related Issues & PRs
- Fixes #362 - Missing Error handling on Open AI API key Error 429
- Fixes #371 - Parent issue for comprehensive error handling
- Merges PR #521 - Core authentication error handling fixes
Breaking Changes
None. This is a non-breaking enhancement that improves error handling without changing existing APIs.
Security Review
- Code review performed using automated security analysis
- Critical vulnerabilities: 0 identified
- ReDoS vulnerability: Fixed with input validation and bounded patterns
- Information disclosure: Prevented with comprehensive sanitization
- Overall assessment: Production-ready with security best practices
The implementation transforms the frustrating silent failure experience into clear, actionable feedback while maintaining robust security and performance standards.
Summary by CodeRabbit
-
New Features
- New Alert and ErrorAlert UI components and a small error-handling hook for dismissible, actionable notifications.
-
Improvements
- Centralized error parsing with sanitized, user-friendly messages and suggested actions.
- API surfaces now surface embedding/auth/quota issues with clearer HTTP responses and pre-flight API-key validation.
- Cache invalidation when credentials change and faster task polling.
-
Tests
- Expanded tests covering quota/error handling, sanitization, and fail-fast RAG behaviors.
@claude-review
Claude finished @Wirasm's task ββ View job
Code Review Complete β
Reviewed PR #371 - Bugfix issue 362 (OpenAI error handling improvements)
Todo List:
- [x] Read and analyze all changed files
- [x] Review error handling implementation
- [x] Check code quality and patterns
- [x] Evaluate test coverage
- [x] Provide comprehensive review feedback
Summary
This PR implements comprehensive OpenAI error handling to address issue #362 where users experienced silent failures during quota exhaustion. The implementation spans both backend (Python/FastAPI) and frontend (React/TypeScript) with proper error propagation and user-friendly messaging.
Previous Review Comments
First review - no previous comments
Issues Found
Total: 0 critical, 3 important, 4 minor
π‘ Important (Should Fix)
-
Frontend Error Handler Not Integrated -
archon-ui-main/src/services/knowledgeBaseErrorHandler.ts:1Problem: The new error handling components and services are created but not actually integrated into any UI pages or components that perform RAG queries Fix: Import and useErrorAlertanduseErrorHandlerin search components, knowledge base pages, or anywhereknowledgeBaseServiceis called -
Unused Import in Service -
python/src/server/services/search/rag_service.py:383Problem: Line 383 referencesuse_rerankingvariable that doesn't exist, should beself.reranking_strategy is not NoneFix: Changeuse_rerankingtoself.reranking_strategy is not None -
Missing Error Display Integration -
archon-ui-main/src/components/ui/ErrorAlert.tsx:1Problem: The ErrorAlert component is well-designed but there's no evidence it's being used in actual pages that perform RAG operations Fix: Import and use in components that callknowledgeBaseService.searchKnowledgeBase()or other error-prone operations
π’ Minor (Consider)
-
Documentation File Location -
OPENAI_ERROR_HANDLING.md:1Consider moving this todocs/folder or similar for better organization -
Missing TypeScript Export -
archon-ui-main/src/services/knowledgeBaseErrorHandler.ts:134Consider adding a default export or barrel export in an index file for easier importing -
Error Message Consistency -
python/src/server/api_routes/knowledge_api.py:764The error messages could be more consistent in tone and structure across different error types -
Test Coverage Gap -
python/tests/test_openai_quota_error_handling.py:1Missing integration test that verifies the full error flow from embedding service β RAG service β API endpoint β frontend
Security Assessment
No security issues found. The implementation properly:
- Validates API responses before processing
- Doesn't expose internal error details to frontend unnecessarily
- Handles OpenAI API keys securely through existing credential service
- Uses appropriate HTTP status codes (429, 502) for different error types
Performance Considerations
- Error handling adds minimal overhead to successful requests
- Proper exception propagation prevents unnecessary processing continuation
- Frontend error parsing is lightweight and efficient
- No performance concerns identified
Good Practices Observed
- Excellent error hierarchy: Well-structured custom exception classes with proper inheritance
- Comprehensive test coverage: Tests cover error propagation through all layers
- User-friendly messaging: Clear, actionable error messages with specific guidance
- Proper HTTP status codes: Uses correct 429 for quota/rate limits, 502 for API errors
- Alpha-appropriate approach: "Fail fast and loud" aligns perfectly with alpha development principles
- Type safety: Strong TypeScript types for error objects and API responses
- Consistent patterns: Follows existing codebase patterns for services and components
Questionable Practices
- Frontend integration incomplete: The error handling infrastructure is built but not connected to actual user workflows
- Missing holistic error strategy: No clear documentation on when/where these error components should be used across the application
Test Coverage
Current Coverage: 85% (estimated based on comprehensive backend tests)
Missing Tests:
-
Frontend Error Handler Integration
- What to test: Error display in actual UI components during RAG queries
- Why important: Ensures users actually see the improved error messages
- Suggested test: Integration test simulating quota error in knowledge search UI
-
End-to-End Error Flow
- What to test: Full error propagation from OpenAI API through to frontend display
- Why important: Verifies the complete user experience improvement
- Suggested test: Mock OpenAI quota error and verify frontend shows proper error alert
Recommendations
Merge Decision:
- [ ] Ready to merge as-is
- [x] Requires fixes before merging
Priority Actions:
- Fix the
use_rerankingundefined variable error in rag_service.py:383 - Integrate ErrorAlert component into actual UI pages that perform RAG queries
- Add usage examples or integration guidance for the frontend error handling
Rationale: This is excellent foundational work for improving error handling. The backend implementation is solid and follows alpha principles well. However, the frontend components need to be actually integrated into the user interface to provide the intended benefit. The undefined variable error should be fixed as it will cause runtime issues.
The architecture and implementation quality is high - this just needs the final integration step to be complete.
Review based on Archon V2 Alpha guidelines and CLAUDE.md principles
Walkthrough
Adds centralized OpenAI/RAG error normalization and sanitization, new UI Alert/ErrorAlert components and hook, embedding-specific exceptions and fail-fast propagation (401/429/502), API key preflight validation, provider cache invalidation on credential changes, RAG embedding call fixes, logging tweaks, and expanded tests for propagation and sanitization.
Changes
| Cohort / File(s) | Change summary |
|---|---|
UI: Alerts & Error displayarchon-ui-main/src/components/ui/Alert.tsx, archon-ui-main/src/components/ui/ErrorAlert.tsx, archon-ui-main/src/components/ui/Alert.tsx |
New Alert + AlertDescription components; ErrorAlert component renders normalized EnhancedError with severity, suggested action, tokens-used and optional dismiss; exports useErrorHandler hook. |
Frontend error parsing & integrationarchon-ui-main/src/services/knowledgeBaseErrorHandler.ts, archon-ui-main/src/services/knowledgeBaseService.ts, archon-ui-main/src/pages/KnowledgeBasePage.tsx |
New centralized parser and helpers (parseKnowledgeBaseError, getDisplayErrorMessage, getErrorSeverity, getErrorAction); service funnels API errors through parser; KnowledgeBasePage surfaces contextual toasts and messages. |
Server API: knowledge endpoints & sanitizationpython/src/server/api_routes/knowledge_api.py |
Preflight OpenAI API key validation, sanitized error responses, maps embedding exceptions to HTTP 401/429/502 with structured payloads. |
Embeddings: exceptions & service logicpython/src/server/services/embeddings/embedding_exceptions.py, python/src/server/services/embeddings/embedding_service.py |
Adds EmbeddingAuthenticationError; maps openai.AuthenticationError β EmbeddingAuthenticationError; propagates auth/quota/rate-limit errors; treats certain rate-limit messages as quota exhaustion; improves logging and early-exit on auth failures. |
RAG/search servicepython/src/server/services/search/rag_service.py |
Switches to single-vector call create_embedding(text=...); fail-fast (raise) on embedding failures; re-raises embedding-related errors to API layer; refines reranking gating and reporting. |
Provider cache & credential handlingpython/src/server/services/llm_provider_service.py, python/src/server/services/credential_service.py |
Adds invalidate_provider_cache and invalidate_specific_cache; CredentialService.set_credential calls invalidate on credential/rag strategy changes. |
Threading/logging tweakpython/src/server/services/threading_service.py |
RateLimiter log format changed to inline message; no control-flow changes. |
Tests: error handling, sanitizer, metrics, RAG adjustmentspython/tests/test_openai_quota_error_handling.py, python/tests/test_document_storage_metrics.py, python/tests/test_rag_simple.py, python/tests/test_rag_strategies.py |
New tests covering quota/auth/rate-limit/API propagation and sanitizer; stabilized metrics mocks; updated embedding call expectations to keyword arg; adapted tests to fail-fast and reranking changes. |
Docs/notespolling-analysis.md |
Polling/staleTime values adjusted (5000β2000ms base, 10000β5000ms stale). |
Sequence Diagram(s)
sequenceDiagram
autonumber
participant UI as KnowledgeBasePage (UI)
participant FES as Frontend Service
participant EH as Error Normalizer
participant API as Server (knowledge_api)
participant RAG as RAG Service
participant EMB as Embedding Service
UI->>FES: request (fetch/refresh/upload/crawl)
FES->>API: HTTP request
API->>RAG: perform_rag_query(...)
RAG->>EMB: create_embedding(text=query)
alt embedding auth/quota/rate-limit/api error
EMB-->>RAG: raise Embedding*Error
RAG-->>API: propagate specific Embedding*Error
API->>API: _sanitize_openai_error(...)
API-->>FES: HTTP 401/429/502 + sanitized payload
FES->>EH: parseKnowledgeBaseError(response)
EH-->>FES: EnhancedError (severity, action, message)
FES-->>UI: normalized error
UI->>UI: render ErrorAlert (variant, description, suggested action)
else success
EMB-->>RAG: embedding vector
RAG-->>API: results
API-->>FES: 200 OK
FES-->>UI: data
end
Estimated code review effort
π― 4 (Complex) | β±οΈ ~75 minutes
Assessment against linked issues
| Objective | Addressed | Explanation |
|---|---|---|
| Surface OpenAI quota exhaustion with explicit errors and logs [#362, #371] | β | |
| Surface OpenAI authentication failures as HTTP 401 with clear messaging [#371] | β | |
| Frontend displays normalized, actionable error messages for KB/RAG operations [#371] | β | |
| Sanitize error messages to prevent sensitive data leakage [#371] | β |
Possibly related PRs
- coleam00/Archon#472 β Changes to crawling/source ID and document storage flows that may interact with RAG and error propagation paths (shared areas: ingestion and doc metadata).
Suggested reviewers
- coleam00
- tazmon95
Poem
I thump my pawβalert bells chime,
No more silent fails at query time.
If keys run dry or auth says βno,β
My crimson banner helps you know.
Hop onβerrors cleared, the logs now glow! πβ¨
β¨ Finishing Touches
- [ ] π Generate Docstrings
π§ͺ Generate unit tests
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
- [ ] Commit unit tests in branch
bugfix-issue-362
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
looking good, but we need to hold this until we decide between WS and Polling @coleam00
looking good, but we need to hold this until we decide between WS and Polling @coleam00
Yeah that makes sense! Now that we're going with polling, you're saying this PR will have to change?
@leex279 Would love to see these resolved and merge this pr now that the main part of the migration is done
The knowledge base services would need to be migrated to the new folder structure
Rebased in #650