helicone icon indicating copy to clipboard operation
helicone copied to clipboard

feat: [ENG-3548] Durable Object fallback for AI Gateway provider keys

Open connortbot opened this issue 3 months ago • 3 comments

https://linear.app/helicone/issue/ENG-3548/fallback-for-keys-when-kvcache-dies Affects Jawn + Worker

Context

In the past we've experienced times when KVCache would fail. Given the traffic of the AI Gateway these days and the way its trending, if this ever happens again our Postgres will die. This adds another layer of caching for provider keys by falling back to a Durable Object namespace built for this use case.

connortbot avatar Nov 18 '25 01:11 connortbot

The latest updates on your projects. Learn more about Vercel for GitHub.

3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
helicone Skipped Skipped Nov 18, 2025 1:47am
helicone-bifrost Skipped Skipped Nov 18, 2025 1:47am
helicone-eu Skipped Skipped Nov 18, 2025 1:47am

vercel[bot] avatar Nov 18 '25 01:11 vercel[bot]

Claude finished @connortbot's task —— View job


PR Review: Durable Object fallback for AI Gateway provider keys

Todo List:

  • [x] Read repository guidelines (CLAUDE.md)
  • [x] Analyze new Durable Object implementation
  • [x] Review Jawn backend changes
  • [x] Review Worker integration changes
  • [x] Check error handling and fallback logic
  • [x] Assess security implications
  • [x] Evaluate performance impact
  • [x] Provide comprehensive feedback with confidence score

Score: 6/10 - Minor issues and improvements suggested, merge with consideration

Critical Issues & Security Concerns:

1. Type Safety Issue in ProviderKeyCache worker/src/lib/durable-objects/ProviderKeyCache.ts:156

provider: row.provider as any, // Cast from string to ModelProviderName

Problem: Unsafe type casting without validation could lead to runtime errors if invalid provider names are stored.

Suggestion: Add validation:

private validateProvider(provider: string): ModelProviderName {
  const validProviders = ['openai', 'anthropic', 'azure']; // Add all valid providers
  if (!validProviders.includes(provider)) {
    throw new Error(`Invalid provider: ${provider}`);
  }
  return provider as ModelProviderName;
}

// Then use: provider: this.validateProvider(row.provider)

2. Missing Error Boundaries in ProviderKeysManager worker/src/lib/managers/ProviderKeysManager.ts:118-133

Problem: If the Durable Object binding is undefined or DO operations fail, the system may not gracefully degrade.

Suggestion: Add better null checks and error handling:

if (kvFailed && this.env.PROVIDER_KEY_CACHE) {
  try {
    const doId = this.env.PROVIDER_KEY_CACHE.idFromName(orgId);
    const doStub: any = this.env.PROVIDER_KEY_CACHE.get(doId);
    
    // Add timeout and better error handling
    const doKeys: ProviderKey[] | null = await Promise.race([
      doStub.getProviderKeys(orgId),
      new Promise((_, reject) => setTimeout(() => reject(new Error('DO timeout')), 5000))
    ]);
    // ... rest of logic
  } catch (error) {
    console.error("DO cache failed:", error);
    // Consider adding metrics/alerting here
  }
}

Performance & Optimization Issues:

3. Inefficient Cache Update Strategy worker/src/lib/managers/ProviderKeysManager.ts:157-183

Problem: The KV cache update logic fetches existing keys and appends new ones, which could lead to duplicates and bloated cache.

Suggestion: Replace the cache entirely instead of appending:

// Instead of appending, replace the entire cache
await storeInCache(
  `provider_keys_${orgId}`,
  JSON.stringify(keysFromDb), // Use fresh data from DB
  this.env,
  43200
);

4. Database Schema Optimization worker/src/lib/durable-objects/ProviderKeyCache.ts:26-50

Problem: The schema creates a composite primary key that might not be optimal for the query patterns.

Suggestion: Consider adding a dedicated primary key:

CREATE TABLE IF NOT EXISTS provider_keys (
  id TEXT PRIMARY KEY DEFAULT (hex(randomblob(16))),
  org_id TEXT NOT NULL,
  provider TEXT NOT NULL,
  -- ... other fields
  UNIQUE(org_id, provider, cuid)
);

Code Quality Improvements:

5. Hardcoded TTL Values - Multiple files

Problem: TTL value (43200 seconds / 12 hours) is hardcoded in multiple places.

Suggestion: Create a constant:

export const PROVIDER_KEY_CACHE_TTL = 43200; // 12 hours

6. Inconsistent Error Handling worker/src/routers/api/doRouter.ts:54-60

Problem: Error handling returns different response formats.

Suggestion: Standardize error responses:

catch (e) {
  const error = e instanceof Error ? e.message : "Internal server error";
  console.error("Error updating provider keys in DO:", e);
  return InternalResponse.newError(error, 500);
}

7. Missing Input Validation valhalla/jawn/src/lib/clients/cloudflareKV.ts:186-231

Problem: The updateProviderKeysInDO function doesn't validate provider keys before sending.

Suggestion: Add validation:

export async function updateProviderKeysInDO(
  orgId: string,
  providerKeys: ProviderKey[]
): Promise<{ success: boolean; error?: string }> {
  // Validate inputs
  if (!orgId || typeof orgId !== 'string') {
    return { success: false, error: 'Invalid orgId' };
  }
  
  if (!Array.isArray(providerKeys)) {
    return { success: false, error: 'Invalid providerKeys format' };
  }
  
  // ... rest of implementation
}

Architectural Concerns:

8. Authentication Security worker/src/routers/api/auth.ts:22-31

Problem: While timing-safe comparison is good, the function could be more robust.

Suggestion: Add rate limiting and additional security headers:

export function validateAdminAuth(
  requestWrapper: RequestWrapper,
  env: Env,
  rateLimitKey?: string // Add optional rate limiting
): Response | null {
  // Add rate limiting logic here if rateLimitKey provided
  
  const authHeader = requestWrapper.headers.get("Authorization");
  if (!authHeader?.startsWith("Bearer ")) {
    return InternalResponse.unauthorized();
  }
  // ... rest of validation
}

Minor Improvements:

9. Transaction Usage worker/src/lib/durable-objects/ProviderKeyCache.ts:99

The transaction usage is good, but consider adding error handling within the transaction.

10. Documentation - Multiple files

Consider adding JSDoc comments to public methods for better maintainability.


Suggestions Summary:

Critical Fixes Needed:

  • Fix unsafe type casting in ProviderKeyCache (Issue #1)
  • Improve error handling for Durable Object operations (Issue #2)

Recommended Improvements:

  • Optimize cache update strategy to prevent duplicates (Issue #3)
  • Add consistent input validation across endpoints (Issue #7)
  • Extract hardcoded TTL values to constants (Issue #5)

Performance Optimizations:

  • Consider schema optimization for better query performance (Issue #4)
  • Add timeout handling for Durable Object calls (Issue #2)

Security Enhancements:

  • Add rate limiting to admin endpoints (Issue #8)
  • Validate provider names before storing (Issue #1)

The implementation provides a solid fallback mechanism for provider key caching, which addresses the core requirement. The Durable Object approach is architecturally sound and the three-tier fallback strategy (KV → DO → Postgres) is well-designed for reliability.

claude[bot] avatar Nov 18 '25 01:11 claude[bot]

Greptile Summary

  • Adds Durable Object-based fallback cache for provider keys when KV namespace fails, protecting Postgres from traffic spikes
  • Implements three-tier caching strategy: KV Cache → DO Cache → Postgres with automatic fallback and cache repopulation
  • Creates new authenticated API endpoints for Jawn to update DO cache alongside KV updates

Confidence Score: 3/5

  • Safe to merge with caution - solid architecture but has cache consistency bugs that need fixing
  • Score reflects well-designed fallback architecture and proper authentication, but deducted points for cache duplication bug in ProviderKeysManager.ts:162-164 where DB keys are appended instead of merged, and missing error handling for JSON.parse in ProviderKeyCache.ts:161 that could crash the Durable Object
  • Pay close attention to worker/src/lib/managers/ProviderKeysManager.ts for the cache merging logic that creates duplicates

Important Files Changed

Filename Overview
worker/src/lib/durable-objects/ProviderKeyCache.ts New Durable Object implementation for provider key caching with SQL storage, TTL checking, and transaction support
worker/src/lib/managers/ProviderKeysManager.ts Implements fallback logic: KV → DO → Postgres, with cache writes after DB fetch but potential cache consistency issues
valhalla/jawn/src/lib/clients/cloudflareKV.ts Added DO update and invalidation functions for provider keys with HTTP calls to worker endpoints

Sequence Diagram

sequenceDiagram
    participant User
    participant Worker
    participant KVCache
    participant DOCache
    participant Postgres
    
    User->>Worker: Request with provider key needed
    Worker->>KVCache: Get provider keys for org
    alt KV Success
        KVCache-->>Worker: Return cached keys
        Worker-->>User: Process request with key
    else KV Fails
        KVCache--xWorker: Error/Timeout
        Worker->>DOCache: Fallback: Get provider keys
        alt DO Success
            DOCache-->>Worker: Return cached keys
            Worker-->>User: Process request with key
        else DO Fails/Empty
            DOCache--xWorker: Error/Empty
            Worker->>Postgres: Fetch from database
            Postgres-->>Worker: Return keys from DB
            Worker->>DOCache: Update DO cache
            Worker->>KVCache: Update KV cache (if not failed)
            Worker-->>User: Process request with key
        end
    end

greptile-apps[bot] avatar Nov 18 '25 01:11 greptile-apps[bot]