ai Respect `retry-after` header for API (Anthropic at least)

Feature Description

Some of API might not rely on exponential backoff (at least not on end-user implementation of it), so they might indicate how long to wait or when you can try again. I'm working with anthropic api and they tell both: how long to wait (via general retry-after header) and when you can retry again (via anthropic specific header).

Proposal: Respect retry-after header (at least for anthropic api) instead of own exponential backoff algorithm.

We can use current approach as a fallback when retry-after header isn't provided.

Alternative: We can give developers onRetry or onError (onFailure) so they can override error handling. I found example here: #4842.

Use Cases

When you reach the limit with anthropic API you usually run out of retries and push back the possibility of actual retry (every failed attempt accounted). While API tells exactly how long to wait.

Additional context

I'll be happy to help implementing that if you can confirm it makes sense (maybe I missed something) and if you agree with either general retry-after or anthropic specific implementation or onError.

Feb 28 '25 10:02 ezhlobo

Much needed @ezhlobo ! We've implemented our own retry mechnanism ourselves too. I would add that it would be great to be able to specify on what errors you want/don't want to retry.

I've opened an issue about this area too: https://github.com/vercel/ai/issues/3619

Apr 15 '25 19:04 agamm

Here's a model wrapper I'm using to implement this, using the wrapLanguageModel provided by ai.

Use it like:

streamText({
    model: wrapWithRetryAfter(anthropic('claude-4-sonnet-20250514')),
    ...
})

import { wrapLanguageModel, type LanguageModelV1, type LanguageModelV1Middleware } from 'ai'

/**
 * Options for the withRetryAfter middleware
 */
export interface WithRetryAfterOptions {
  /** Maximum number of retries (default: 3) */
  maxRetries?: number
  /** Base delay in milliseconds (default: 1000) */
  baseDelay?: number
  /** Maximum delay in milliseconds (default: 60000) */
  maxDelay?: number
  /** Whether to use exponential backoff (default: true) */
  exponentialBackoff?: boolean
  /** Custom retry condition function */
  shouldRetry?: (error: any) => boolean
  /** Custom delay calculation function */
  calculateDelay?: (retryCount: number, retryAfter?: string) => number
}

/**
 * Default options for withRetryAfter
 */
const DEFAULT_OPTIONS: Required<WithRetryAfterOptions> = {
  maxRetries: 3,
  baseDelay: 1000,
  maxDelay: 60000,
  exponentialBackoff: true,
  shouldRetry: (error: any) => {
    // Retry on 429 rate limit errors
    return error?.status === 429 || 
           error?.message?.includes('rate limit') ||
           error?.message?.includes('429')
  },
  calculateDelay: (retryCount: number, retryAfter?: string) => {
    // If retry-after header is provided, use it
    if (retryAfter) {
      const retryAfterSeconds = parseInt(retryAfter, 10)
      if (!isNaN(retryAfterSeconds)) {
        return retryAfterSeconds * 1000
      }
    }
    
    // Otherwise use exponential backoff
    const delay = Math.min(
      DEFAULT_OPTIONS.baseDelay * Math.pow(2, retryCount),
      DEFAULT_OPTIONS.maxDelay
    )
    
    // Add jitter to prevent thundering herd
    const jitter = Math.random() * 0.1 * delay
    return delay + jitter
  }
}

/**
 * Sleep function for delays
 */
function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms))
}

/**
 * Extracts retry-after header from error response
 */
function extractRetryAfter(error: any): string | undefined {
  if (error?.response?.headers) {
    return error.response.headers.get('retry-after') || 
           error.response.headers['retry-after']
  }
  return undefined
}

/**
 * Creates a middleware that implements retry logic with retry-after header support
 */
export function withRetryAfter(options: WithRetryAfterOptions = {}): LanguageModelV1Middleware {
  const opts = { ...DEFAULT_OPTIONS, ...options }
  
  return {
    middlewareVersion: 'v1',
    
    wrapGenerate: async ({ doGenerate, params, model }) => {
      let lastError: any
      let retryCount = 0
      
      while (retryCount <= opts.maxRetries) {
        try {
          return await doGenerate()
        } catch (error) {
          lastError = error
          
          // Check if we should retry this error
          if (!opts.shouldRetry(error)) {
            throw error
          }
          
          // Check if we've exceeded max retries
          if (retryCount >= opts.maxRetries) {
            throw error
          }
          
          // Extract retry-after header
          const retryAfter = extractRetryAfter(error)
          
          // Calculate delay
          const delay = opts.calculateDelay(retryCount, retryAfter)
          
          console.log(`Rate limit hit, retrying in ${delay}ms (attempt ${retryCount + 1}/${opts.maxRetries + 1})`)
          
          // Wait before retrying
          await sleep(delay)
          retryCount++
        }
      }
      
      throw lastError
    },
    
    wrapStream: async ({ doStream, params, model }) => {
      let lastError: any
      let retryCount = 0
      
      while (retryCount <= opts.maxRetries) {
        try {
          return await doStream()
        } catch (error) {
          lastError = error
          
          // Check if we should retry this error
          if (!opts.shouldRetry(error)) {
            throw error
          }
          
          // Check if we've exceeded max retries
          if (retryCount >= opts.maxRetries) {
            throw error
          }
          
          // Extract retry-after header
          const retryAfter = extractRetryAfter(error)
          
          // Calculate delay
          const delay = opts.calculateDelay(retryCount, retryAfter)
          
          console.log(`Rate limit hit, retrying in ${delay}ms (attempt ${retryCount + 1}/${opts.maxRetries + 1})`)
          
          // Wait before retrying
          await sleep(delay)
          retryCount++
        }
      }
      
      throw lastError
    }
  }
}

/**
 * Wraps a language model with retry-after middleware
 */
export function wrapWithRetryAfter(
  model: LanguageModelV1,
  options?: WithRetryAfterOptions
): LanguageModelV1 {
  return wrapLanguageModel({
    model,
    middleware: withRetryAfter(options)
  })
}

Jun 21 '25 21:06 homeyer

Would appreciate this! We run into rate limits while concurrently processing background stuff. I'd rather have things stable, and simply respect the header and wait untill we can run again.

Note that a recent contribution was already merged to (in part) fix this. But it doesn't respect the header if it is >60s, while I would prefer to have the option to respect it nevertheless

Nov 07 '25 07:11 RickVM