Why You Need A Gateway
The first time a team calls openai.chat.completions.create directly from application code, everything is fine. The second time — another service, maybe a different provider, maybe a cached version — things start to drift. By the tenth call site, you have:
- No unified view of what's being spent on AI
- No way to switch providers without code changes
- Rate limits that hit individual services unpredictably
- No caching — you pay for the same completion twice
- Observability that lives in the provider dashboard, not yours
- No guardrails at the boundary
An LLM gateway is the same answer that API gateways gave to HTTP sprawl fifteen years ago: a single, shared layer that every outgoing model call passes through. In 2026 two tools dominate for engineering teams that don't want to write their own from scratch: Portkey for routing, fallback, and caching, and Langfuse for observability, evaluation, and cost attribution. They compose well and both run fine self-hosted.
This post is the architecture we recommend and the TypeScript you need to wire it together.
What The Gateway Should Actually Do
A useful gateway has seven jobs. Not every feature is day-one, but all seven should be in the roadmap.
- Unified interface. One SDK or HTTP contract your application calls, regardless of which provider is behind it.
- Provider routing. Route to OpenAI, Anthropic, Google, Mistral, or a self-hosted vLLM endpoint based on model name, feature needs, or cost.
- Fallback and retry. If the primary provider is degraded, fall back automatically.
- Caching. Exact-match and semantic caching to eliminate duplicate calls.
- Rate limiting and budgets. Per-team, per-customer, per-environment.
- Observability. Every call logged with prompts, responses, tokens, latency, cost, and user context.
- Guardrails. Input scrubbing, output filtering, PII redaction, prompt injection detection.
Reference Architecture
Here is the shape we deploy for mid-sized teams:
+----------------------+
Client app -----> | LLM Gateway API | -----> OpenAI
| (TypeScript / Bun) | Anthropic
| + Portkey core | Google
+----------+-----------+ Self-hosted vLLM
|
| async traces
v
+--------------+
| Langfuse |
| (self-host) |
+--------------+
|
v
Cost dashboards, evals,
per-customer attribution
The gateway itself is a stateless service — we run it as a Kubernetes Deployment with a horizontal autoscaler. Redis handles cache and rate limits. Langfuse runs as its own stack (web, worker, Postgres, ClickHouse, S3-compatible storage) either self-hosted or via Langfuse Cloud.
The Core Client Interface
Let's start from the application's point of view. Apps should never call provider SDKs directly — they call the gateway through a thin client that enforces the contract.
// packages/llm-client/src/types.ts
export interface LLMRequest {
readonly model: string;
readonly messages: readonly {
readonly role: "system" | "user" | "assistant";
readonly content: string;
}[];
readonly maxTokens?: number;
readonly temperature?: number;
readonly metadata: {
readonly team: string;
readonly customer?: string;
readonly feature: string;
readonly traceId?: string;
};
}
export interface LLMResponse {
readonly text: string;
readonly model: string;
readonly provider: string;
readonly inputTokens: number;
readonly outputTokens: number;
readonly costUsd: number;
readonly latencyMs: number;
readonly cached: boolean;
}
export interface LLMClient {
complete(req: LLMRequest): Promise<LLMResponse>;
}
A few deliberate choices. Metadata is required, not optional — every call must be attributable to a team and feature. Cost and provider are returned explicitly so callers can log them. And the cache hit flag is exposed because it shows up in evals and billing reconciliation.
The Gateway Implementation
The gateway is a Bun service that uses the Portkey SDK as the core routing engine and Langfuse for tracing.
// apps/llm-gateway/src/gateway.ts
import Portkey from "portkey-ai";
import { Langfuse } from "langfuse";
import type { LLMRequest, LLMResponse } from "@example/llm-client";
export interface GatewayConfig {
readonly portkeyApiKey: string;
readonly langfusePublicKey: string;
readonly langfuseSecretKey: string;
readonly langfuseHost: string;
}
export class LLMGateway {
private readonly portkey: Portkey;
private readonly langfuse: Langfuse;
constructor(cfg: GatewayConfig) {
this.portkey = new Portkey({
apiKey: cfg.portkeyApiKey,
config: "pc-main-router",
});
this.langfuse = new Langfuse({
publicKey: cfg.langfusePublicKey,
secretKey: cfg.langfuseSecretKey,
baseUrl: cfg.langfuseHost,
});
}
async complete(req: LLMRequest): Promise<LLMResponse> {
const trace = this.langfuse.trace({
name: req.metadata.feature,
userId: req.metadata.customer,
metadata: {
team: req.metadata.team,
upstream_trace: req.metadata.traceId,
},
});
const generation = trace.generation({
name: "llm.complete",
model: req.model,
input: req.messages,
modelParameters: {
temperature: req.temperature ?? 0.7,
maxTokens: req.maxTokens ?? 1024,
},
});
const started = Date.now();
try {
const result = await this.portkey.chat.completions.create({
model: req.model,
messages: req.messages as { role: string; content: string }[],
max_tokens: req.maxTokens ?? 1024,
temperature: req.temperature ?? 0.7,
metadata: {
team: req.metadata.team,
customer: req.metadata.customer ?? "unknown",
feature: req.metadata.feature,
},
});
const choice = result.choices[0];
const text = choice?.message?.content ?? "";
const usage = result.usage ?? { prompt_tokens: 0, completion_tokens: 0 };
const latency = Date.now() - started;
const cached = (result as { cached?: boolean }).cached ?? false;
generation.end({
output: text,
usage: {
promptTokens: usage.prompt_tokens,
completionTokens: usage.completion_tokens,
},
});
return {
text,
model: result.model,
provider: (result as { provider?: string }).provider ?? "unknown",
inputTokens: usage.prompt_tokens,
outputTokens: usage.completion_tokens,
costUsd: estimateCost(result.model, usage),
latencyMs: latency,
cached,
};
} catch (err) {
generation.end({ output: null, level: "ERROR", statusMessage: String(err) });
throw err;
} finally {
await this.langfuse.flushAsync();
}
}
}
function estimateCost(
model: string,
usage: { prompt_tokens: number; completion_tokens: number },
): number {
const rates: Record<string, { input: number; output: number }> = {
"gpt-4.1": { input: 2.5e-6, output: 10e-6 },
"claude-sonnet-4-6-20260115": { input: 3e-6, output: 15e-6 },
"gemini-2.5-pro": { input: 1.25e-6, output: 5e-6 },
};
const rate = rates[model] ?? { input: 0, output: 0 };
return usage.prompt_tokens * rate.input + usage.completion_tokens * rate.output;
}
The Portkey config referenced by pc-main-router lives in the Portkey console (or self-hosted config file). That's where routing, fallback, and caching rules are declared — not in code.
Routing And Fallback As Config
Portkey uses a JSON config to describe routing strategies. A realistic config with a primary, a fallback, semantic caching, and a canary for a new model:
{
"strategy": { "mode": "fallback" },
"targets": [
{
"strategy": { "mode": "loadbalance" },
"targets": [
{
"virtual_key": "openai-prod",
"override_params": { "model": "gpt-4.1" },
"weight": 0.9
},
{
"virtual_key": "openai-prod",
"override_params": { "model": "gpt-4.1-mini" },
"weight": 0.1
}
]
},
{
"virtual_key": "anthropic-prod",
"override_params": { "model": "claude-sonnet-4-6-20260115" }
}
],
"cache": {
"mode": "semantic",
"max_age": 3600
},
"retry": {
"attempts": 3,
"on_status_codes": [429, 500, 502, 503, 504]
}
}
Reading that top-down: 90% of requests go to GPT-4.1, 10% to the mini canary, and if the OpenAI target fails entirely Portkey falls back to Claude. Semantic cache holds for an hour. Retries handle transient errors.
Changing the routing now does not require a code change, a deploy, or even a restart. That's the main reason you want a gateway to begin with.
Semantic Caching
Exact-match caching is trivial — hash the prompt, use Redis. Semantic caching is more interesting: embed the prompt, look up similar prompts in a vector store, return a cached response if similarity exceeds a threshold. Portkey ships this out of the box; the flag is "cache": { "mode": "semantic" } above.
When to use it:
- Yes: FAQ-style support bots, documentation Q&A, known template answers.
- Careful: agents that pull fresh data — stale responses will hurt you.
- No: anything with personalization or real-time inputs.
Cache hit rates we see in production vary wildly by use case: 30-50% for internal docs Q&A, 5-15% for customer-facing assistants, near zero for code generation. Measure yours before assuming savings.
Per-Customer Budgets And Rate Limits
Budgets keep a single customer from burning the monthly bill. The gateway enforces them from Redis before calling the provider.
// apps/llm-gateway/src/budget.ts
import Redis from "ioredis";
export class BudgetEnforcer {
constructor(private readonly redis: Redis) {}
async tryConsume(customer: string, usd: number): Promise<void> {
const key = `budget:${customer}:${this.currentMonth()}`;
const limitKey = `budget-limit:${customer}`;
const [currentRaw, limitRaw] = await this.redis.mget(key, limitKey);
const current = currentRaw ? Number(currentRaw) : 0;
const limit = limitRaw ? Number(limitRaw) : 100;
if (current + usd > limit) {
throw new BudgetExceededError(customer, current, limit);
}
await this.redis.incrbyfloat(key, usd);
await this.redis.expire(key, 60 * 60 * 24 * 40);
}
private currentMonth(): string {
const now = new Date();
return `${now.getUTCFullYear()}-${String(now.getUTCMonth() + 1).padStart(2, "0")}`;
}
}
export class BudgetExceededError extends Error {
constructor(
public readonly customer: string,
public readonly spent: number,
public readonly limit: number,
) {
super(`budget exceeded for ${customer}: $${spent.toFixed(2)} / $${limit.toFixed(2)}`);
}
}
Budgets are charged after the call with the real cost, but a pre-flight estimate avoids hitting the provider at all if the customer is already over. In practice we do both.
Observability With Langfuse
Langfuse gives you traces, generations, scores, and evaluations. It works well because it treats an LLM call as a first-class object with input, output, model, parameters, and cost — not a generic log line.
Things to wire up on day one:
- Traces grouped by user session and feature.
- Generations for each model call.
- Scores attached to generations — user thumbs up/down, offline eval results, safety classifier outputs.
- Datasets for regression testing — save known-good interactions and replay them on every model upgrade.
The SDK takes three lines per call (we already showed them above). The value you get back is substantial: real cost per customer, real latency per feature, and the ability to look at last Tuesday's incident and see the exact prompt that caused it.
Deployment
The gateway is a small service and should run as such.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-gateway
namespace: ai-platform
spec:
replicas: 3
selector:
matchLabels: { app: llm-gateway }
template:
metadata:
labels: { app: llm-gateway }
spec:
containers:
- name: gateway
image: ghcr.io/example/llm-gateway:1.4.2
ports:
- containerPort: 8080
env:
- name: PORTKEY_API_KEY
valueFrom: { secretKeyRef: { name: portkey, key: api-key } }
- name: LANGFUSE_PUBLIC_KEY
valueFrom: { secretKeyRef: { name: langfuse, key: public } }
- name: LANGFUSE_SECRET_KEY
valueFrom: { secretKeyRef: { name: langfuse, key: secret } }
- name: LANGFUSE_HOST
value: https://langfuse.ai.example.com
- name: REDIS_URL
valueFrom: { secretKeyRef: { name: redis, key: url } }
resources:
requests: { cpu: 200m, memory: 256Mi }
limits: { memory: 512Mi }
readinessProbe:
httpGet: { path: /ready, port: 8080 }
livenessProbe:
httpGet: { path: /live, port: 8080 }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-gateway
namespace: ai-platform
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-gateway
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 60 }
Run 3 replicas minimum — the gateway is on the critical path of every AI feature in the product and a single-pod outage is not okay.
Canary Deployments Of New Models
When a new model version lands (say, OpenAI ships gpt-4.2), you want to roll it out the same way you'd roll out a new version of your own service. The Portkey load-balance config we showed above is exactly this: route 10% of traffic to the new model, watch quality and latency in Langfuse, then expand.
Define the success criteria explicitly before the canary. We typically look at:
- User satisfaction score (thumbs up/down) — must not regress.
- Task completion rate (does the user get what they asked for?) — must not regress.
- p95 latency — must not regress significantly.
- Cost per interaction — must not regress more than 20%.
If any of these break, roll back by flipping a config value. No redeploy.
What Not To Put In The Gateway
Feature creep is the enemy. Things that belong elsewhere:
- Prompt templates. Those live with the feature code. The gateway should not know what your prompts are.
- Agent orchestration. Use Temporal, LangGraph, or similar. The gateway calls one model at a time.
- Business logic. The gateway makes an LLM call and returns. It doesn't decide whether to call one.
Keep the gateway boring and narrow. Narrow things don't break.
Next Steps
An LLM gateway is the first piece of AI platform infrastructure most teams should build after their first production feature. It pays for itself within a quarter through cost savings and lets you swap providers, add guardrails, and debug production issues without code changes. Portkey plus Langfuse gets you most of the way with two days of integration work. If you want help designing a gateway for your stack or migrating existing direct calls, get in touch.