Back to Blog

Building LLM APIs in 2025: What Nobody Tells You

By mid-2025, the "just wrap GPT in an API" era was mercifully over. The demo-to-production gap turned out to be enormous, and I spent a good chunk of the year learning that the hard way.

Building APIs that talk to LLMs is fundamentally different from anything I'd built before. Not harder, necessarily - just weird in ways that break your existing mental models.

Latency is a different animal

My brain is wired for the traditional API world where 200ms is slow. LLMs take seconds. Sometimes 10+ seconds. Your API gateway, your load balancer, your client timeout config - all of it assumes responses come back fast. They don't.

Streaming (SSE, WebSockets) isn't a nice-to-have. It's mandatory. I learned this after building a sync endpoint, watching it work great in development, and then seeing users stare at a blank screen for 8 seconds in production. That was a fun afternoon.

For anything that doesn't need an immediate response - document summarization, batch processing - I moved to async workflows with SQS and worker pools. Let the main API stay responsive, let the heavy lifting happen in the background. Classic stuff, but easy to forget when you're caught up in the LLM hype.

The context window tax

Even with models supporting massive context windows, you don't just dump everything in and pray. I mean, you can, but your bill will be spectacular.

I ended up building RAG pipelines - vector database to retrieve relevant chunks, inject them into the prompt, keep the context tight. It's not elegant. It's plumbing. But it works, and it keeps your per-request cost from looking like a phone number.

Session management is the other piece. LLMs are stateless. Every request is a blank slate. So your API layer has to maintain the conversation history (I used Redis for this), decide how many turns to keep, and trim intelligently. Get this wrong and your chatbot either has amnesia or costs you $50 per conversation.

Rate limiting by tokens, not requests

Traditional rate limiting counts requests. LLM APIs need to count tokens. A user sending one massive prompt can cost more than a thousand small ones. I built a token-based rate limiter backed by Redis that tracks consumption per API key over sliding windows. It's not complex, but it's the kind of thing that doesn't exist in any off-the-shelf solution because the LLM API space is still too young.

I also added semantic caching - hash the prompt, check if you've seen it recently, return the cached response. Simple optimization, surprisingly effective for apps where multiple users ask similar questions.

The agentic part

The most interesting (and chaotic) work was building orchestration layers for agentic workflows. Users don't just want Q&A anymore. They want the API to do things - query databases, call external services, make decisions, chain reasoning steps together.

This is where MCP and tool calling patterns come in. You're essentially building a control plane for an autonomous agent, and the engineering challenge shifts from "make it work" to "make it work safely." Constraining what an agent can do, logging every action, building kill switches - this is the real work, and it's more system design than ML.

Honestly, building LLM APIs made me a better engineer in general. It forced me to think about failure modes I'd never considered, and to be way more intentional about system design upfront. The models are impressive, but the infrastructure around them is where the actual engineering lives.