feat: [ENG-3548] Durable Object fallback for AI Gateway provider keys
https://linear.app/helicone/issue/ENG-3548/fallback-for-keys-when-kvcache-dies Affects Jawn + Worker
Context
In the past we've experienced times when KVCache would fail. Given the traffic of the AI Gateway these days and the way its trending, if this ever happens again our Postgres will die. This adds another layer of caching for provider keys by falling back to a Durable Object namespace built for this use case.
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
| Project | Deployment | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| helicone | Nov 18, 2025 1:47am | |||
| helicone-bifrost | Nov 18, 2025 1:47am | |||
| helicone-eu | Nov 18, 2025 1:47am |
Claude finished @connortbot's task —— View job
PR Review: Durable Object fallback for AI Gateway provider keys
Todo List:
- [x] Read repository guidelines (CLAUDE.md)
- [x] Analyze new Durable Object implementation
- [x] Review Jawn backend changes
- [x] Review Worker integration changes
- [x] Check error handling and fallback logic
- [x] Assess security implications
- [x] Evaluate performance impact
- [x] Provide comprehensive feedback with confidence score
Score: 6/10 - Minor issues and improvements suggested, merge with consideration
Critical Issues & Security Concerns:
1. Type Safety Issue in ProviderKeyCache worker/src/lib/durable-objects/ProviderKeyCache.ts:156
provider: row.provider as any, // Cast from string to ModelProviderName
Problem: Unsafe type casting without validation could lead to runtime errors if invalid provider names are stored.
Suggestion: Add validation:
private validateProvider(provider: string): ModelProviderName {
const validProviders = ['openai', 'anthropic', 'azure']; // Add all valid providers
if (!validProviders.includes(provider)) {
throw new Error(`Invalid provider: ${provider}`);
}
return provider as ModelProviderName;
}
// Then use: provider: this.validateProvider(row.provider)
2. Missing Error Boundaries in ProviderKeysManager worker/src/lib/managers/ProviderKeysManager.ts:118-133
Problem: If the Durable Object binding is undefined or DO operations fail, the system may not gracefully degrade.
Suggestion: Add better null checks and error handling:
if (kvFailed && this.env.PROVIDER_KEY_CACHE) {
try {
const doId = this.env.PROVIDER_KEY_CACHE.idFromName(orgId);
const doStub: any = this.env.PROVIDER_KEY_CACHE.get(doId);
// Add timeout and better error handling
const doKeys: ProviderKey[] | null = await Promise.race([
doStub.getProviderKeys(orgId),
new Promise((_, reject) => setTimeout(() => reject(new Error('DO timeout')), 5000))
]);
// ... rest of logic
} catch (error) {
console.error("DO cache failed:", error);
// Consider adding metrics/alerting here
}
}
Performance & Optimization Issues:
3. Inefficient Cache Update Strategy worker/src/lib/managers/ProviderKeysManager.ts:157-183
Problem: The KV cache update logic fetches existing keys and appends new ones, which could lead to duplicates and bloated cache.
Suggestion: Replace the cache entirely instead of appending:
// Instead of appending, replace the entire cache
await storeInCache(
`provider_keys_${orgId}`,
JSON.stringify(keysFromDb), // Use fresh data from DB
this.env,
43200
);
4. Database Schema Optimization worker/src/lib/durable-objects/ProviderKeyCache.ts:26-50
Problem: The schema creates a composite primary key that might not be optimal for the query patterns.
Suggestion: Consider adding a dedicated primary key:
CREATE TABLE IF NOT EXISTS provider_keys (
id TEXT PRIMARY KEY DEFAULT (hex(randomblob(16))),
org_id TEXT NOT NULL,
provider TEXT NOT NULL,
-- ... other fields
UNIQUE(org_id, provider, cuid)
);
Code Quality Improvements:
5. Hardcoded TTL Values - Multiple files
Problem: TTL value (43200 seconds / 12 hours) is hardcoded in multiple places.
Suggestion: Create a constant:
export const PROVIDER_KEY_CACHE_TTL = 43200; // 12 hours
6. Inconsistent Error Handling worker/src/routers/api/doRouter.ts:54-60
Problem: Error handling returns different response formats.
Suggestion: Standardize error responses:
catch (e) {
const error = e instanceof Error ? e.message : "Internal server error";
console.error("Error updating provider keys in DO:", e);
return InternalResponse.newError(error, 500);
}
7. Missing Input Validation valhalla/jawn/src/lib/clients/cloudflareKV.ts:186-231
Problem: The updateProviderKeysInDO function doesn't validate provider keys before sending.
Suggestion: Add validation:
export async function updateProviderKeysInDO(
orgId: string,
providerKeys: ProviderKey[]
): Promise<{ success: boolean; error?: string }> {
// Validate inputs
if (!orgId || typeof orgId !== 'string') {
return { success: false, error: 'Invalid orgId' };
}
if (!Array.isArray(providerKeys)) {
return { success: false, error: 'Invalid providerKeys format' };
}
// ... rest of implementation
}
Architectural Concerns:
8. Authentication Security worker/src/routers/api/auth.ts:22-31
Problem: While timing-safe comparison is good, the function could be more robust.
Suggestion: Add rate limiting and additional security headers:
export function validateAdminAuth(
requestWrapper: RequestWrapper,
env: Env,
rateLimitKey?: string // Add optional rate limiting
): Response | null {
// Add rate limiting logic here if rateLimitKey provided
const authHeader = requestWrapper.headers.get("Authorization");
if (!authHeader?.startsWith("Bearer ")) {
return InternalResponse.unauthorized();
}
// ... rest of validation
}
Minor Improvements:
9. Transaction Usage worker/src/lib/durable-objects/ProviderKeyCache.ts:99
The transaction usage is good, but consider adding error handling within the transaction.
10. Documentation - Multiple files
Consider adding JSDoc comments to public methods for better maintainability.
Suggestions Summary:
Critical Fixes Needed:
- Fix unsafe type casting in ProviderKeyCache (Issue #1)
- Improve error handling for Durable Object operations (Issue #2)
Recommended Improvements:
- Optimize cache update strategy to prevent duplicates (Issue #3)
- Add consistent input validation across endpoints (Issue #7)
- Extract hardcoded TTL values to constants (Issue #5)
Performance Optimizations:
- Consider schema optimization for better query performance (Issue #4)
- Add timeout handling for Durable Object calls (Issue #2)
Security Enhancements:
- Add rate limiting to admin endpoints (Issue #8)
- Validate provider names before storing (Issue #1)
The implementation provides a solid fallback mechanism for provider key caching, which addresses the core requirement. The Durable Object approach is architecturally sound and the three-tier fallback strategy (KV → DO → Postgres) is well-designed for reliability.
Greptile Summary
- Adds Durable Object-based fallback cache for provider keys when KV namespace fails, protecting Postgres from traffic spikes
- Implements three-tier caching strategy: KV Cache → DO Cache → Postgres with automatic fallback and cache repopulation
- Creates new authenticated API endpoints for Jawn to update DO cache alongside KV updates
Confidence Score: 3/5
- Safe to merge with caution - solid architecture but has cache consistency bugs that need fixing
- Score reflects well-designed fallback architecture and proper authentication, but deducted points for cache duplication bug in ProviderKeysManager.ts:162-164 where DB keys are appended instead of merged, and missing error handling for JSON.parse in ProviderKeyCache.ts:161 that could crash the Durable Object
- Pay close attention to
worker/src/lib/managers/ProviderKeysManager.tsfor the cache merging logic that creates duplicates
Important Files Changed
| Filename | Overview |
|---|---|
| worker/src/lib/durable-objects/ProviderKeyCache.ts | New Durable Object implementation for provider key caching with SQL storage, TTL checking, and transaction support |
| worker/src/lib/managers/ProviderKeysManager.ts | Implements fallback logic: KV → DO → Postgres, with cache writes after DB fetch but potential cache consistency issues |
| valhalla/jawn/src/lib/clients/cloudflareKV.ts | Added DO update and invalidation functions for provider keys with HTTP calls to worker endpoints |
Sequence Diagram
sequenceDiagram
participant User
participant Worker
participant KVCache
participant DOCache
participant Postgres
User->>Worker: Request with provider key needed
Worker->>KVCache: Get provider keys for org
alt KV Success
KVCache-->>Worker: Return cached keys
Worker-->>User: Process request with key
else KV Fails
KVCache--xWorker: Error/Timeout
Worker->>DOCache: Fallback: Get provider keys
alt DO Success
DOCache-->>Worker: Return cached keys
Worker-->>User: Process request with key
else DO Fails/Empty
DOCache--xWorker: Error/Empty
Worker->>Postgres: Fetch from database
Postgres-->>Worker: Return keys from DB
Worker->>DOCache: Update DO cache
Worker->>KVCache: Update KV cache (if not failed)
Worker-->>User: Process request with key
end
end