What Multi-Provider AI Abstraction Actually Costs You
The real price of unifying OpenRouter, Gemini, Ollama, OpenAI, and Anthropic behind a single interface — what it solves, what it adds, and the tradeoff that surfaces when you're routing 200+ models in production.
The first version had one provider: OpenRouter. That lasted until OpenRouter returned a 429 during a product demo. The feature failed silently, the user saw a spinner that never resolved, and the bug report said "AI is broken."
The second version added a manual fallback — if OpenRouter fails, try Ollama. That lasted until both were unavailable simultaneously, at which point the fallback logic lived in three different places in the application code and none of them agreed on what "failed" meant.
What I'm describing here is the version that came after those two.
The Interface
The naive approach is provider-specific code at the call site:
if (provider === "openai") {
const response = await openai.chat.completions.create({ ... });
} else if (provider === "gemini") {
const response = await gemini.generateContent({ ... });
}
At five providers, every new model touches business logic. Error handling diverges. There is nothing unified to observe, so degradation shows up as a user complaint rather than a metric. The abstraction goal is simple: the application sends a message and gets a response, and never knows which provider answered.
Every adapter implements one interface:
interface AIProvider {
name: string;
isAvailable(): Promise<boolean>;
complete(messages: Message[], options: CompletionOptions): Promise<string>;
stream(messages: Message[], options: CompletionOptions): AsyncIterable<string>;
}
isAvailable matters more than it looks. I originally wired health checks to complete — just try it and see if it throws. That stopped working when I discovered a provider could pass a lightweight ping but still return 429s on actual completions because they were on separate endpoints with separate rate limits. Health checks and completions need to be distinct operations.
Everything else — normalising SDK response shapes, wrapping provider-specific error formats into a shared AIError class, handling Gemini's safety filter response format — stays contained inside the adapter. It never leaks outward.
The Routing Layer
AIManager is a singleton that owns all routing decisions and holds health state in memory:
class AIManager {
private static instance: AIManager;
private providers: Map<string, AIProvider>;
private healthCache: Map<string, { healthy: boolean; checkedAt: number }>;
private usageCounters: Map<string, number>;
}
The singleton means health state persists across requests without a database. For a Node.js server kept alive by PM2, this works reliably. Multi-instance deployments need Redis for shared state — the interface stays identical, only the backing store changes.
Routing tries providers in priority order. A rate limit or transient failure marks the provider unhealthy and moves on:
for (const provider of ordered) {
if (!this.isHealthy(provider.name)) continue;
try {
const result = await provider.complete(messages, options);
this.recordSuccess(provider.name);
return result;
} catch (error) {
if (isRateLimitError(error) || isTransientError(error)) {
this.markUnhealthy(provider.name, BACKOFF_MS);
continue;
}
throw error;
}
}
This handled most failure cases. It did not handle the failure mode I didn't anticipate.
The Thing the Backoff Didn't Cover
After the system was in production for a few weeks, I started seeing elevated response times that didn't map to any single provider being down. The logs showed transient errors — normal-looking 429s, nothing alarming. I added more granular logging and the picture that came back was strange: OpenRouter was failing, backing off for 30 seconds, re-entering rotation, and failing again almost immediately. The backoff window kept resetting. The provider was cycling in and out of rotation continuously while the other providers absorbed a disproportionate share of requests, and each individual failure looked normal in isolation so nothing in the metrics flagged it.
I had to graph failure timestamps per provider before the pattern became obvious. Once I saw it, the cause was clear: per-failure backoff is reactive. It handles a spike but has no memory of repeated failures. A provider that fails on 80% of requests will spend most of its time in backoff windows but still re-enter rotation regularly, causing constant noise for any request unlucky enough to hit it.
The circuit breaker adds aggregate state. When failures exceed a threshold in a rolling window, the circuit opens and the provider exits rotation entirely until a background health check confirms it has recovered:
private checkCircuit(providerName: string): "open" | "closed" {
const stats = this.circuitStats.get(providerName);
if (!stats) return "closed";
const recentFailures = stats.failures.filter(
(t) => Date.now() - t < CIRCUIT_WINDOW_MS
);
if (recentFailures.length >= CIRCUIT_THRESHOLD) return "open";
return "closed";
}
The distinction: backoff is per-failure and short (seconds). Circuit breaking is aggregate and longer (minutes). They solve different problems. I only understood that after watching the thrashing for long enough to see the pattern.
Structured Output
One smaller problem worth naming: asking models to return JSON sounds trivial and isn't. Models wrap JSON in markdown fences, add trailing commas, return partial JSON when token limits are hit. A minimal multi-format parser handles most of this without needing model-specific logic:
function parseStructuredOutput<T>(raw: string, schema: ZodSchema<T>): T {
const stripped = raw.replace(/^```(?:json)?\n?/, "").replace(/\n?```$/, "");
try { return schema.parse(JSON.parse(stripped)); } catch {}
const match = stripped.match(/[\[{][\s\S]*[\]}]/);
if (match) {
try { return schema.parse(JSON.parse(match[0])); } catch {}
}
throw new Error(`Could not parse structured output: ${raw.slice(0, 200)}`);
}
Prompting models harder to return clean JSON works most of the time and fails at exactly the wrong moment. The parser is more reliable.
What This Actually Costs
The abstraction works. Adding a provider is one file. Swapping the primary provider is transparent to the rest of the system.
What it costs is state you cannot see during development. Health cache, circuit stats, usage counters — they live in the singleton and answer questions like "why is latency elevated today?" and "which provider is getting the most traffic?" But only if you built the observability into the manager from the start. I didn't. I added structured logging and counters after the second production incident, once I understood what questions I needed to ask.
The routing logic is the interesting problem. The debugging infrastructure is what you actually depend on. I'd design the observability surface first — what logs get emitted on every routing decision, what counters feed into a dashboard — and treat the routing as secondary. That reversal is not obvious until you've needed the debugging infrastructure and not had it.
Whether the circuit breaker thresholds I landed on are right, I genuinely don't know. The values I'm using (CIRCUIT_THRESHOLD = 5, CIRCUIT_WINDOW_MS = 60_000) came from watching real failure patterns and adjusting. A different traffic profile would probably need different numbers. That's the part of this that is still provisional.