Archon icon indicating copy to clipboard operation
Archon copied to clipboard

Fix #362: Surface OpenAI authentication errors as HTTP 401

Open leex279 opened this issue 4 months ago β€’ 6 comments

Fix #362: Complete OpenAI API error handling + Merge PR #521

Summary

Resolves issue #362 where users experienced silent failures and empty results when OpenAI API quota was exhausted or other API errors occurred. This PR now includes all changes from PR #521 which provides the core authentication error handling fixes.

Key Improvements

πŸ›‘οΈ Critical Authentication Error Handling (from PR #521)

  • EmbeddingAuthenticationError handling added to RAG service - authentication failures now bubble up as HTTP 401 responses
  • Single-vector API fix - Fixed create_embedding(query) β†’ create_embedding(text=query) call
  • Reranking guard improvement - Enhanced logic to use is not None instead of truthy checks
  • HTTP 401 responses - Knowledge API now properly catches authentication errors and returns clear error messages

πŸ”§ Enhanced Error Handling

  • Comprehensive error parsing for OpenAI authentication (401), quota (429), and rate limit (429) errors
  • Actionable notifications tell users exactly how to resolve issues (check Settings, add credits, wait for rate limits)
  • Error message enhancement in progress tracking and WebSocket events
  • Consistent messaging across all operation types (new crawls, refresh, retry, uploads)

πŸ”’ Security Hardening

  • ReDoS vulnerability fixed with input size limits (5000 chars) and bounded regex patterns
  • Information disclosure prevention with enhanced error message sanitization
  • API key protection automatically redacts keys, org IDs, project IDs, URLs from error messages
  • Memory leak prevention with proper timeout cleanup in frontend requests

⚑ Performance Optimizations

  • Request optimization with proper timeout management and cleanup
  • Validation caching reduces API calls and improves responsiveness
  • Race condition fixes ensure proper event ordering in progress tracking
  • Memory management improvements prevent frontend resource leaks

Before vs After

Before (Issue #362)

  • ❌ Silent failures when OpenAI API quota exhausted
  • ❌ Empty results with no error indication
  • ❌ Users spent 90+ minutes debugging
  • ❌ No validation before starting operations
  • ❌ Wasted time crawling with invalid credentials

After (This PR + PR #521)

  • βœ… Clear error messages for all OpenAI API failures: HTTP 401: "Invalid API key - Please check your API key in settings"
  • βœ… No more silent failures - authentication errors immediately surface to users
  • βœ… Actionable guidance: "Check your API key in Settings", "Add credits to your account"
  • βœ… Complete coverage across all knowledge base operations
  • βœ… Enhanced security with sanitized error messages
  • βœ… Optimized performance with smart caching and cleanup

Technical Implementation

Changes from PR #521 (Now Merged)

  • python/src/server/services/search/rag_service.py - Added EmbeddingAuthenticationError handling + single-vector API fix
  • python/src/server/services/embeddings/embedding_service.py - Enhanced authentication error mapping
  • python/src/server/api_routes/knowledge_api.py - HTTP 401 error responses with clear messages
  • python/src/server/services/embeddings/embedding_exceptions.py - EmbeddingAuthenticationError class (already existed)

Additional Changes

  • archon-ui-main/src/services/knowledgeBaseService.ts - Enhanced error parsing and validation caching
  • archon-ui-main/src/pages/KnowledgeBasePage.tsx - UI improvements for error handling

Testing

  • [x] All existing tests pass
  • [x] Manual testing of error scenarios (invalid API key, quota exhaustion, rate limits)
  • [x] Security testing for sanitization and ReDoS prevention
  • [x] Performance testing of validation caching
  • [x] End-to-end testing of all knowledge base operations

Type of Change

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [x] Security improvement (fixes vulnerabilities)
  • [x] Performance improvement (optimizes existing functionality)
  • [ ] Breaking change

Related Issues & PRs

  • Fixes #362 - Missing Error handling on Open AI API key Error 429
  • Fixes #371 - Parent issue for comprehensive error handling
  • Merges PR #521 - Core authentication error handling fixes

Breaking Changes

None. This is a non-breaking enhancement that improves error handling without changing existing APIs.

Security Review

  • Code review performed using automated security analysis
  • Critical vulnerabilities: 0 identified
  • ReDoS vulnerability: Fixed with input validation and bounded patterns
  • Information disclosure: Prevented with comprehensive sanitization
  • Overall assessment: Production-ready with security best practices

The implementation transforms the frustrating silent failure experience into clear, actionable feedback while maintaining robust security and performance standards.

Summary by CodeRabbit

  • New Features

    • New Alert and ErrorAlert UI components and a small error-handling hook for dismissible, actionable notifications.
  • Improvements

    • Centralized error parsing with sanitized, user-friendly messages and suggested actions.
    • API surfaces now surface embedding/auth/quota issues with clearer HTTP responses and pre-flight API-key validation.
    • Cache invalidation when credentials change and faster task polling.
  • Tests

    • Expanded tests covering quota/error handling, sanitization, and fail-fast RAG behaviors.

leex279 avatar Aug 20 '25 08:08 leex279

@claude-review

Wirasm avatar Aug 20 '25 08:08 Wirasm

Claude finished @Wirasm's task β€”β€” View job


Code Review Complete βœ…

Reviewed PR #371 - Bugfix issue 362 (OpenAI error handling improvements)

Todo List:

  • [x] Read and analyze all changed files
  • [x] Review error handling implementation
  • [x] Check code quality and patterns
  • [x] Evaluate test coverage
  • [x] Provide comprehensive review feedback

Summary

This PR implements comprehensive OpenAI error handling to address issue #362 where users experienced silent failures during quota exhaustion. The implementation spans both backend (Python/FastAPI) and frontend (React/TypeScript) with proper error propagation and user-friendly messaging.

Previous Review Comments

First review - no previous comments

Issues Found

Total: 0 critical, 3 important, 4 minor

🟑 Important (Should Fix)

  • Frontend Error Handler Not Integrated - archon-ui-main/src/services/knowledgeBaseErrorHandler.ts:1 Problem: The new error handling components and services are created but not actually integrated into any UI pages or components that perform RAG queries Fix: Import and use ErrorAlert and useErrorHandler in search components, knowledge base pages, or anywhere knowledgeBaseService is called

  • Unused Import in Service - python/src/server/services/search/rag_service.py:383 Problem: Line 383 references use_reranking variable that doesn't exist, should be self.reranking_strategy is not None Fix: Change use_reranking to self.reranking_strategy is not None

  • Missing Error Display Integration - archon-ui-main/src/components/ui/ErrorAlert.tsx:1 Problem: The ErrorAlert component is well-designed but there's no evidence it's being used in actual pages that perform RAG operations Fix: Import and use in components that call knowledgeBaseService.searchKnowledgeBase() or other error-prone operations

🟒 Minor (Consider)

  • Documentation File Location - OPENAI_ERROR_HANDLING.md:1 Consider moving this to docs/ folder or similar for better organization

  • Missing TypeScript Export - archon-ui-main/src/services/knowledgeBaseErrorHandler.ts:134 Consider adding a default export or barrel export in an index file for easier importing

  • Error Message Consistency - python/src/server/api_routes/knowledge_api.py:764 The error messages could be more consistent in tone and structure across different error types

  • Test Coverage Gap - python/tests/test_openai_quota_error_handling.py:1 Missing integration test that verifies the full error flow from embedding service β†’ RAG service β†’ API endpoint β†’ frontend

Security Assessment

No security issues found. The implementation properly:

  • Validates API responses before processing
  • Doesn't expose internal error details to frontend unnecessarily
  • Handles OpenAI API keys securely through existing credential service
  • Uses appropriate HTTP status codes (429, 502) for different error types

Performance Considerations

  • Error handling adds minimal overhead to successful requests
  • Proper exception propagation prevents unnecessary processing continuation
  • Frontend error parsing is lightweight and efficient
  • No performance concerns identified

Good Practices Observed

  • Excellent error hierarchy: Well-structured custom exception classes with proper inheritance
  • Comprehensive test coverage: Tests cover error propagation through all layers
  • User-friendly messaging: Clear, actionable error messages with specific guidance
  • Proper HTTP status codes: Uses correct 429 for quota/rate limits, 502 for API errors
  • Alpha-appropriate approach: "Fail fast and loud" aligns perfectly with alpha development principles
  • Type safety: Strong TypeScript types for error objects and API responses
  • Consistent patterns: Follows existing codebase patterns for services and components

Questionable Practices

  • Frontend integration incomplete: The error handling infrastructure is built but not connected to actual user workflows
  • Missing holistic error strategy: No clear documentation on when/where these error components should be used across the application

Test Coverage

Current Coverage: 85% (estimated based on comprehensive backend tests)

Missing Tests:

  1. Frontend Error Handler Integration

    • What to test: Error display in actual UI components during RAG queries
    • Why important: Ensures users actually see the improved error messages
    • Suggested test: Integration test simulating quota error in knowledge search UI
  2. End-to-End Error Flow

    • What to test: Full error propagation from OpenAI API through to frontend display
    • Why important: Verifies the complete user experience improvement
    • Suggested test: Mock OpenAI quota error and verify frontend shows proper error alert

Recommendations

Merge Decision:

  • [ ] Ready to merge as-is
  • [x] Requires fixes before merging

Priority Actions:

  1. Fix the use_reranking undefined variable error in rag_service.py:383
  2. Integrate ErrorAlert component into actual UI pages that perform RAG queries
  3. Add usage examples or integration guidance for the frontend error handling

Rationale: This is excellent foundational work for improving error handling. The backend implementation is solid and follows alpha principles well. However, the frontend components need to be actually integrated into the user interface to provide the intended benefit. The undefined variable error should be fixed as it will cause runtime issues.

The architecture and implementation quality is high - this just needs the final integration step to be complete.


Review based on Archon V2 Alpha guidelines and CLAUDE.md principles

claude[bot] avatar Aug 20 '25 08:08 claude[bot]

Walkthrough

Adds centralized OpenAI/RAG error normalization and sanitization, new UI Alert/ErrorAlert components and hook, embedding-specific exceptions and fail-fast propagation (401/429/502), API key preflight validation, provider cache invalidation on credential changes, RAG embedding call fixes, logging tweaks, and expanded tests for propagation and sanitization.

Changes

Cohort / File(s) Change summary
UI: Alerts & Error display
archon-ui-main/src/components/ui/Alert.tsx, archon-ui-main/src/components/ui/ErrorAlert.tsx, archon-ui-main/src/components/ui/Alert.tsx
New Alert + AlertDescription components; ErrorAlert component renders normalized EnhancedError with severity, suggested action, tokens-used and optional dismiss; exports useErrorHandler hook.
Frontend error parsing & integration
archon-ui-main/src/services/knowledgeBaseErrorHandler.ts, archon-ui-main/src/services/knowledgeBaseService.ts, archon-ui-main/src/pages/KnowledgeBasePage.tsx
New centralized parser and helpers (parseKnowledgeBaseError, getDisplayErrorMessage, getErrorSeverity, getErrorAction); service funnels API errors through parser; KnowledgeBasePage surfaces contextual toasts and messages.
Server API: knowledge endpoints & sanitization
python/src/server/api_routes/knowledge_api.py
Preflight OpenAI API key validation, sanitized error responses, maps embedding exceptions to HTTP 401/429/502 with structured payloads.
Embeddings: exceptions & service logic
python/src/server/services/embeddings/embedding_exceptions.py, python/src/server/services/embeddings/embedding_service.py
Adds EmbeddingAuthenticationError; maps openai.AuthenticationError β†’ EmbeddingAuthenticationError; propagates auth/quota/rate-limit errors; treats certain rate-limit messages as quota exhaustion; improves logging and early-exit on auth failures.
RAG/search service
python/src/server/services/search/rag_service.py
Switches to single-vector call create_embedding(text=...); fail-fast (raise) on embedding failures; re-raises embedding-related errors to API layer; refines reranking gating and reporting.
Provider cache & credential handling
python/src/server/services/llm_provider_service.py, python/src/server/services/credential_service.py
Adds invalidate_provider_cache and invalidate_specific_cache; CredentialService.set_credential calls invalidate on credential/rag strategy changes.
Threading/logging tweak
python/src/server/services/threading_service.py
RateLimiter log format changed to inline message; no control-flow changes.
Tests: error handling, sanitizer, metrics, RAG adjustments
python/tests/test_openai_quota_error_handling.py, python/tests/test_document_storage_metrics.py, python/tests/test_rag_simple.py, python/tests/test_rag_strategies.py
New tests covering quota/auth/rate-limit/API propagation and sanitizer; stabilized metrics mocks; updated embedding call expectations to keyword arg; adapted tests to fail-fast and reranking changes.
Docs/notes
polling-analysis.md
Polling/staleTime values adjusted (5000β†’2000ms base, 10000β†’5000ms stale).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant UI as KnowledgeBasePage (UI)
  participant FES as Frontend Service
  participant EH as Error Normalizer
  participant API as Server (knowledge_api)
  participant RAG as RAG Service
  participant EMB as Embedding Service

  UI->>FES: request (fetch/refresh/upload/crawl)
  FES->>API: HTTP request
  API->>RAG: perform_rag_query(...)
  RAG->>EMB: create_embedding(text=query)
  alt embedding auth/quota/rate-limit/api error
    EMB-->>RAG: raise Embedding*Error
    RAG-->>API: propagate specific Embedding*Error
    API->>API: _sanitize_openai_error(...)
    API-->>FES: HTTP 401/429/502 + sanitized payload
    FES->>EH: parseKnowledgeBaseError(response)
    EH-->>FES: EnhancedError (severity, action, message)
    FES-->>UI: normalized error
    UI->>UI: render ErrorAlert (variant, description, suggested action)
  else success
    EMB-->>RAG: embedding vector
    RAG-->>API: results
    API-->>FES: 200 OK
    FES-->>UI: data
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Assessment against linked issues

Objective Addressed Explanation
Surface OpenAI quota exhaustion with explicit errors and logs [#362, #371] βœ…
Surface OpenAI authentication failures as HTTP 401 with clear messaging [#371] βœ…
Frontend displays normalized, actionable error messages for KB/RAG operations [#371] βœ…
Sanitize error messages to prevent sensitive data leakage [#371] βœ…

Possibly related PRs

  • coleam00/Archon#472 β€” Changes to crawling/source ID and document storage flows that may interact with RAG and error propagation paths (shared areas: ingestion and doc metadata).

Suggested reviewers

  • coleam00
  • tazmon95

Poem

I thump my pawβ€”alert bells chime,
No more silent fails at query time.
If keys run dry or auth says β€œno,”
My crimson banner helps you know.
Hop onβ€”errors cleared, the logs now glow! πŸ‡βœ¨

✨ Finishing Touches
  • [ ] πŸ“ Generate Docstrings
πŸ§ͺ Generate unit tests
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment
  • [ ] Commit unit tests in branch bugfix-issue-362

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Aug 20 '25 15:08 coderabbitai[bot]

2025-08-21_22h27_46

leex279 avatar Aug 21 '25 20:08 leex279

looking good, but we need to hold this until we decide between WS and Polling @coleam00

Wirasm avatar Sep 01 '25 15:09 Wirasm

looking good, but we need to hold this until we decide between WS and Polling @coleam00

Yeah that makes sense! Now that we're going with polling, you're saying this PR will have to change?

coleam00 avatar Sep 02 '25 00:09 coleam00

@leex279 Would love to see these resolved and merge this pr now that the main part of the migration is done

Wirasm avatar Sep 12 '25 14:09 Wirasm

The knowledge base services would need to be migrated to the new folder structure

Wirasm avatar Sep 12 '25 14:09 Wirasm

Rebased in #650

leex279 avatar Sep 13 '25 12:09 leex279