Product
How we scaled our platform to 50 million agent runs daily
A deep dive into the infrastructure decisions that let us handle massive scale without breaking a sweat — or the bank.
Published on
Written by

David Park

Last month, we hit a milestone: 50 million agent runs processed in a single day. Zero downtime. Average latency under 100ms. And our infrastructure costs actually went down compared to the previous quarter.
This post is a behind-the-scenes look at how we got here.
When we started, like most startups, we ran everything on a single Kubernetes cluster. It worked fine for our first few customers. But as usage grew, we started hitting walls. Cold starts were killing our latency. Scaling was reactive, not predictive. And our AWS bill was becoming a recurring nightmare in our board meetings.
The first big change was moving to a multi-region architecture. We now run inference workloads across 12 regions globally, automatically routing requests to the nearest healthy cluster. This alone cut our p99 latency by 60%. Users in Singapore were no longer waiting for round-trips to us-east-1.
The second change was rethinking how we handle bursty workloads. AI agents are inherently unpredictable. A single customer might go from 100 requests per minute to 10,000 in seconds. Traditional auto-scaling couldn't keep up — by the time new instances spun up, the burst was over.
Our solution was predictive scaling based on historical patterns combined with aggressive warm pooling. We maintain a reserve of pre-warmed instances that can absorb traffic spikes instantly. The system learns each customer's usage patterns and pre-provisions capacity before they need it. It sounds expensive, but it's actually cheaper than reactive scaling because we waste fewer resources on cold starts.
The third change was optimizing our inference layer. We built a custom request batching system that groups similar queries together, maximizing GPU utilization without sacrificing latency. We also implemented speculative execution for multi-step agent workflows — starting likely next steps before the current step completes.
The results speak for themselves. Our infrastructure now handles 50M+ daily runs with 99.99% uptime. Median latency is 42ms. And we're doing it at a cost per request that's 3x lower than when we started.
We'll be open-sourcing parts of this infrastructure in the coming months. If you're interested in early access, drop us a line.
OTHER BLOGS
Explore other blogs
Apr 12, 2026
The future of work is agentic and it's already happening
AI agents aren't coming — they're already transforming how companies operate. Here's what we've learned from powering millions of agent interactions every day.
Apr 11, 2026
How Acme Corp cut their costs by 60% using AI agents
A deep dive into how one fast-growing startup automated their customer support without sacrificing quality — and actually improved customer satisfaction.
Apr 25, 2026
Advanced agent patterns: working with loops and more
Take your agents to the next level with advanced workflow patterns — from conditional logic to approval workflows to multi-agent orchestration.
Apr 25, 2026
Why we're betting everything on the future of AI agents
The future isn't chatbots or copilots — it's autonomous agents that get work done. Here's why we're building the infrastructure to make that happen.
Apr 21, 2026
How to build and deploy your first AI agent in 15 minutes
A step-by-step tutorial to create, test, and deploy a customer support agent that actually resolves tickets — not just responds to them.
Apr 18, 2026
Introducing the new visual workflow builder for AI agents
Design complex agent workflows with drag-and-drop simplicity. Connect LLMs, tools, and APIs visually — no infrastructure code required.
