GPT-4 Fallback Chains: Keeping Your AI Feature Alive When the Provider Isn't

→

OpenAI has 99.9% uptime SLA on their API. That's 8.7 hours of downtime per year. When your AI feature goes down, users don't see an SLA number — they see a broken product. Design for the 0.1%.

TalentSync is a resume-job fit optimizer. Users paste a job description, upload their resume, and get an analysis of fit, gaps, and specific suggestions for tailoring their application. The core feature is entirely AI-powered: extract structure, compare competencies, generate suggestions. No AI = no product.

That dependency is a risk. I designed a fallback chain from day one. Here's the complete implementation: GPT-4 primary, GPT-4o-mini fallback, Zod-validated outputs, retry logic with exponential backoff, and the monitoring that lets you know when things are degraded before users do.

Why You Need a Fallback Chain

LLM provider failures fall into four categories:

1.Hard outages: the API returns 5xx errors. These are visible and rare (~4 hours/year for major providers).
2.Rate limiting: you've exceeded your tier's RPM (requests per minute) or TPM (tokens per minute). This happens more often than you'd expect during traffic spikes.
3.Timeout: the API hangs for 30+ seconds. This happens during high-load periods even when the API isn't fully down.
4.Degraded quality: the API returns 200 but the response is malformed, truncated, or nonsensical. This is the hardest failure mode to catch because it looks like success.

A simple retry handles cases 1 and 3. A fallback model handles case 2 (rate limits on the primary model don't affect rate limits on a different model). Zod validation handles case 4 — it turns a silent quality failure into a detectable error you can route around.

The Output Schema: Start Here

Before writing any LLM code, define what success looks like as a TypeScript type and a Zod schema. This forces clarity and gives you a validation layer that catches bad responses before they reach your users.

schemas.tstypescript

import { z } from "zod";

export const AnalysisResultSchema = z.object({
  fitScore: z.number().min(0).max(100),
  fitSummary: z.string().min(20).max(300),
  strengths: z.array(z.string().min(5)).min(1).max(6),
  gaps: z.array(
    z.object({
      skill: z.string(),
      severity: z.enum(["critical", "moderate", "minor"]),
      suggestion: z.string().min(20),
    })
  ).max(8),
  tailoringTips: z.array(z.string().min(20)).min(2).max(5),
  keywordMatches: z.array(z.string()).max(20),
  atsOptimizations: z.array(z.string()).max(5),
});

export type AnalysisResult = z.infer<typeof AnalysisResultSchema>;

// Minimal fallback schema — used when the full schema fails
// This ensures we can always return something useful
export const MinimalAnalysisSchema = z.object({
  fitScore: z.number().min(0).max(100),
  fitSummary: z.string().min(10),
  strengths: z.array(z.string()).min(1),
  gaps: z.array(z.object({
    skill: z.string(),
    severity: z.enum(["critical", "moderate", "minor"]),
    suggestion: z.string(),
  })),
  tailoringTips: z.array(z.string()).min(1),
  keywordMatches: z.array(z.string()),
  atsOptimizations: z.array(z.string()),
});

Structured Outputs vs JSON Mode

OpenAI offers two ways to get structured JSON: JSON mode (set response_format to json_object) and Structured Outputs (set response_format to json_schema with your schema). The difference matters:

—JSON mode: guarantees valid JSON syntax, but not your schema. You can still get {"result": null} instead of the full object.
—Structured Outputs: guarantees adherence to your exact schema. GPT-4o and GPT-4o-mini support this. GPT-4 (original) does not.
—Zod + Structured Outputs: belt and suspenders. Structured Outputs guarantees the schema; Zod validates the values (e.g., fitScore must be 0-100, not -5 or 200).

llm-client.tstypescript

import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";
import { AnalysisResultSchema, MinimalAnalysisSchema } from "./schemas";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

interface CallOptions {
  model: string;
  systemPrompt: string;
  userPrompt: string;
  schema: z.ZodSchema;
  maxTokens?: number;
  timeoutMs?: number;
}

async function callWithStructuredOutput<T>(options: CallOptions): Promise<T> {
  const { model, systemPrompt, userPrompt, schema, maxTokens = 2048, timeoutMs = 30_000 } = options;

  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const response = await openai.beta.chat.completions.parse(
      {
        model,
        messages: [
          { role: "system", content: systemPrompt },
          { role: "user", content: userPrompt },
        ],
        response_format: zodResponseFormat(schema, "analysis"),
        max_tokens: maxTokens,
      },
      { signal: controller.signal }
    );

    const parsed = response.choices[0].message.parsed;
    if (!parsed) {
      throw new Error("Model refused to generate a structured response");
    }

    // Validate with Zod (belt and suspenders after Structured Outputs)
    return schema.parse(parsed) as T;
  } finally {
    clearTimeout(timeout);
  }
}

The Fallback Chain Implementation

The chain has three levels: GPT-4o primary (best quality), GPT-4o-mini fallback (cheaper, faster, still good), and a degraded mode that returns a partial result. Each level tries twice with exponential backoff before falling to the next.

analyze.tstypescript

import { AnalysisResultSchema, MinimalAnalysisSchema, type AnalysisResult } from "./schemas";
import { callWithStructuredOutput } from "./llm-client";

const SYSTEM_PROMPT = `You are a resume analysis expert. Given a job description and resume,
provide a detailed fit analysis with specific, actionable recommendations.
Be concrete and specific — vague advice is worthless to job seekers.`;

function buildUserPrompt(jobDescription: string, resumeText: string): string {
  return `JOB DESCRIPTION:
${jobDescription.slice(0, 3000)}

RESUME:
${resumeText.slice(0, 3000)}

Analyze the fit between this resume and job description. Be specific about gaps and provide
concrete suggestions for how to address each gap in the resume.`;
}

async function withRetry<T>(
  fn: () => Promise<T>,
  maxAttempts: number,
  baseDelayMs: number = 1000
): Promise<T> {
  let lastError: Error | undefined;

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err as Error;

      // Don't retry on validation errors — bad output won't improve with retries
      if (err instanceof ZodError) throw err;

      // Don't retry on abort (timeout)
      if ((err as Error).name === "AbortError") throw err;

      if (attempt < maxAttempts) {
        const delay = baseDelayMs * Math.pow(2, attempt - 1);
        await new Promise((resolve) => setTimeout(resolve, delay));
      }
    }
  }

  throw lastError;
}

export async function analyzeResumeFit(
  jobDescription: string,
  resumeText: string
): Promise<AnalysisResult & { model: string; degraded: boolean }> {
  const userPrompt = buildUserPrompt(jobDescription, resumeText);

  // Level 1: GPT-4o with full schema, 2 attempts
  try {
    const result = await withRetry(
      () => callWithStructuredOutput<AnalysisResult>({
        model: "gpt-4o",
        systemPrompt: SYSTEM_PROMPT,
        userPrompt,
        schema: AnalysisResultSchema,
        timeoutMs: 45_000,
      }),
      2,
      1000
    );
    return { ...result, model: "gpt-4o", degraded: false };
  } catch (primaryError) {
    console.error("[analyzeResumeFit] GPT-4o failed:", (primaryError as Error).message);
  }

  // Level 2: GPT-4o-mini fallback, 2 attempts
  try {
    const result = await withRetry(
      () => callWithStructuredOutput<AnalysisResult>({
        model: "gpt-4o-mini",
        systemPrompt: SYSTEM_PROMPT,
        userPrompt,
        schema: AnalysisResultSchema,
        timeoutMs: 30_000,
      }),
      2,
      500
    );
    return { ...result, model: "gpt-4o-mini", degraded: false };
  } catch (fallbackError) {
    console.error("[analyzeResumeFit] GPT-4o-mini failed:", (fallbackError as Error).message);
  }

  // Level 3: GPT-4o-mini with minimal schema — return partial result
  // This almost never fails unless OpenAI is completely down
  try {
    const result = await callWithStructuredOutput({
      model: "gpt-4o-mini",
      systemPrompt: SYSTEM_PROMPT,
      userPrompt,
      schema: MinimalAnalysisSchema,
      maxTokens: 1024,  // Shorter = more likely to succeed
      timeoutMs: 20_000,
    });
    return { ...result as AnalysisResult, model: "gpt-4o-mini", degraded: true };
  } catch {
    throw new Error("All LLM providers failed. This indicates a complete OpenAI outage.");
  }
}

The Failure Modes You Don't Expect

After running this in production for several months, here are the failures I didn't anticipate:

—Rate limits on gpt-4o don't affect gpt-4o-mini rate limits. This is the biggest win of the two-model approach — different models have independent rate limit buckets.
—Structured Outputs occasionally returns null for a required field. The model 'refuses' to fill in a field it considers sensitive or inappropriate. Zod catches this; the fallback handles it.
—Long resumes (10+ pages) hit context window limits and produce truncated JSON. We now truncate inputs to 3,000 chars each before sending. For longer documents, we extract the most relevant sections first.
—The degraded mode (level 3) returns a result users can still act on. It's better than a generic error message. Tell users when they're getting a degraded response — they appreciate honesty.

Monitoring: How You Know Before Users Do

monitoring.tstypescript

// Track model usage and fallback rates
export async function trackAnalysis(result: {
  model: string;
  degraded: boolean;
  durationMs: number;
  userId: string;
}) {
  await analytics.track({
    event: "resume_analysis_complete",
    properties: {
      model: result.model,
      degraded: result.degraded,
      duration_ms: result.durationMs,
      is_fallback: result.model === "gpt-4o-mini",
    },
  });

  // Alert if fallback rate > 5% over rolling 5 minutes
  await incrementCounter("llm_requests_total");
  if (result.model === "gpt-4o-mini") {
    await incrementCounter("llm_fallback_total");
  }
}

// In your monitoring system, alert on:
// llm_fallback_total / llm_requests_total > 0.05 over 5m
// llm_fallback_total / llm_requests_total > 0.20 over 1m (page someone)

99.2%

Primary model success rate

GPT-4o in production

0.8%

Fallback trigger rate

Routes to GPT-4o-mini

<0.01%

Level 3 trigger rate

Degraded mode almost never needed

User-visible errors

Since adding fallback chain

Cost Implications

GPT-4o costs roughly 10x more per token than GPT-4o-mini. With a 0.8% fallback rate, the cost impact is negligible — you're paying for 0.8% of requests at 10x lower cost. The real cost saving is the avoided incident response and user churn from an outage.

One tactical note: we do NOT use the fallback model for cost optimization. The fallback is strictly for reliability. If you start routing requests to the cheaper model for cost reasons, you've mixed two concerns and the reliability semantics break down.

→

Zod is the most underrated part of this architecture. Without it, a malformed LLM response is a runtime error somewhere deep in your rendering code with a confusing stack trace. With it, malformed responses fail fast at the boundary with a clear error you can route around. Always validate LLM outputs at the API boundary.