Scaling and performance for software engineers
Table of Contents
You and a few other engineers are building out Helpdesk AI. Companies sign up, upload their docs and help articles, and it hosts an AI support agent that answers their customers’ questions.
Under the hood it does the obvious thing: retrieve the relevant docs, hand them to a language model, and stream back an answer. The agent can also take actions like looking up an order or issuing a refund, every conversation gets logged, and we bill each company based on usage.
It all runs on a single server, and with a handful of design-partner companies, everything feels instant. Then a bigger customer signs on, traffic climbs, and the same chats that used to answer in half a second start taking eight.
Your code didn’t change. The load did.
That gap between “works for ten users” and “holds up for ten million” is what this post is about.
We’ll grow Helpdesk AI step by step and watch what breaks.
So what does scalability actually mean? #
Two words get used interchangeably here, and they shouldn’t be.
Performance is how fast a single request is when the system is quiet. Scalability is whether the system holds up as you pile on more work, usually by adding resources. They’re related, but not the same thing, and the fix for one rarely fixes the other.
A system can be fast but not scalable: snappy for ten users, faceplants at ten thousand. It can also be slow but perfectly scalable: every request takes two seconds whether you have ten users or ten million.
The instinct when something feels slow is “add more servers,” but if a single request is slow because you’re doing redundant work inside it, ten servers will just be slow ten times in parallel. You have a performance problem wearing a scalability costume.
So before touching anything, figure out which one you actually have.
How do you know you have a problem? #
You can’t fix what you can’t see, so the first move is measurement. Two numbers carry most of the weight.
Latency is how long one request takes. Throughput is how many requests you can handle per second. They trade off in ways that surprise people: a system can have lovely latency when idle and fall apart on throughput the moment real traffic shows up.
The trap is looking at averages. If your average chat response is 200 milliseconds, that sounds great, until you realize your p99 (the slowest 1 percent of requests) is eight seconds. That means one in every hundred customers is having a miserable time, and at scale, one in a hundred is a lot of furious people. Tail latencies are where users actually feel the pain, so watch the percentiles, not the mean.
For Helpdesk AI we track two more numbers: time to first token, since we stream answers and nobody likes staring at a blank box, and dollars per conversation, because every model call costs real money and that bill scales right alongside our traffic.
Once you’re measuring, you can set targets. An SLI is the thing you measure (99th-percentile chat latency). An SLO is the goal you hold yourself to internally (95 percent of chats answer in under three seconds). An SLA is the promise you make to a customer, with penalties attached when you miss it. Set these early, because they tell you when a problem is worth fixing and, just as importantly, when it isn’t.
Why you shouldn’t reach for more servers first #
When things get slow, distributing the work is usually the last thing you should try, not the first. The cheapest and biggest wins are almost always local, and they come in a rough order of leverage.
Start at the architecture level. Are you doing work you don’t need to do at all? If Helpdesk AI re-embeds a tenant’s entire knowledge base on every single question, that’s not a scaling problem, that’s a design mistake. Embed once, store it, reuse it. If a thousand customers ask the same question, calling the model a thousand times for the same answer is pure waste.
Next, algorithms and data structures. The classic offender is the N+1 query, where you fetch a list and then loop over it firing one query per item:
# N+1: one database round trip per retrieved chunk
chunks = vector_store.search(query, k=10)
for c in chunks:
doc = db.query("SELECT * FROM docs WHERE id = ?", c.doc_id) # 10 trips
Ten network round trips where one would do. Fetch them together instead:
ids = [c.doc_id for c in chunks]
docs = db.query("SELECT * FROM docs WHERE id IN (?)", ids) # 1 trip
This is also where picking the right vector index (a flat scan versus something like HNSW) decides whether retrieval takes milliseconds or seconds.
Below that sits the database, which we’ll spend real time on shortly, and only at the very bottom is infrastructure: throwing more or bigger machines at the problem. Each level up the list is cheaper and higher-leverage than the one below it. Exhaust the easy wins before you reach for hardware.
Scaling up versus scaling out #
Say you’ve done all that and you genuinely need more capacity. You have two directions to go.
Scaling up, or vertical scaling, means a bigger box: more CPU, more memory. It’s wonderfully simple because your code doesn’t change at all. But there’s a ceiling (eventually you’re renting the largest machine that exists), and it’s still one machine, so when it dies, everything dies with it.
Scaling out, or horizontal scaling, means more boxes instead of bigger ones. The capacity is effectively unlimited and you survive any single machine failing, but now you need something sitting in front to spread requests across them. That something is a load balancer.
Going horizontal forces one discipline on you: your app servers have to be stateless. If you stash a customer’s chat session in a server’s local memory, their next message might land on a different server that’s never heard of them. Push that state into a shared store everyone can reach.
The load balancer itself can operate at layer 4, routing on basic network info like IP and port (fast), or layer 7, where it can see the HTTP request and route on richer detail (smarter, slightly slower). It spreads load using simple strategies: round robin (take turns), least-connections (favor the least busy server), or hashing on something like tenant ID so a given tenant consistently lands on the same server.
Why the database becomes the bottleneck #
Once app servers are cheap to add, the pressure moves to the thing they all share. Every server talks to the same database, and it can’t be cloned away as casually. Fixing it is a ladder you climb one rung at a time.
The first rung is query optimization. Add the index that turns a full-table scan into an instant lookup, kill any N+1s still hiding in your code. It’s the cheapest rung and you should never skip it.
The second is read replicas. Most workloads read far more than they write, and Helpdesk AI is no exception: the analytics dashboard and conversation history are almost all reads. So you keep copies of the database, send all reads to the replicas, and reserve the primary for writes.
The catch is replication lag: replicas trail the primary by a fraction of a second, so a row you just wrote might not show up on a replica immediately. That’s fine for a dashboard, and a problem if you read data you depend on having just written.
The third rung, and the highest-leverage one for us, is caching. Because every model call is slow and expensive, caching retrieved chunks and even whole answers to common questions is a massive win on both latency and cost. Keeping the cache in sync with the database has a few flavors. With cache-aside, the app checks the cache, and on a miss it reads the database and populates the cache itself. With write-through, you write to cache and database together so they never disagree. With write-back, you write to the cache and flush to the database later, which is fast but risks losing data if things crash. And the two hard problems everyone quotes: invalidation (when a tenant updates their docs, every cached answer built on the old docs is now wrong) and staleness (how out-of-date you’re willing to be).
The last rung is sharding. When even the primary can’t keep up with writes, and remember that every chat message is a write, you split the data across multiple databases. Sharding by tenant ID is natural here: each company’s data lives on one shard, and queries for that company hit exactly one box. It works, but it buys you new headaches, like queries that need data from many tenants and the occasional need to rebalance shards.
Going distributed isn’t free #
Step back and notice what just happened. You began with one tidy box where everything was simple, and you now have multiple app servers, read replicas, a cache, and sharded databases. You’re running a distributed system, whether you set out to or not. And distributed systems quietly revoke a guarantee you never knew you were leaning on: that everyone sees the same data at the same instant.
CAP theorem is the blunt version. When the network between your nodes breaks, and over enough time it will, you get to keep consistency (everyone sees the latest data, but some requests fail) or availability (every request gets an answer, but it might be stale). During that partition you cannot have both. PACELC adds the part people forget: even when there’s no partition, you’re still trading latency against consistency. Waiting for every replica to confirm a write is slower but safe; not waiting is faster but can serve stale reads.
This sounds abstract until you map it onto Helpdesk AI. When the agent issues a refund, that absolutely must commit, exactly once, no stale reads, money is on the line, so you want strong consistency. The dashboard that says “1,204 conversations today” can be a few seconds behind and nobody will ever notice, so eventual consistency is fine and cheaper. Usage billing has to be exact. The lesson is that you choose a consistency model per kind of data (strong, eventual, read-your-writes), not once for the whole system.
Why you should stop doing work synchronously #
A lot of work simply doesn’t need to happen while the customer waits. When a tenant uploads their entire knowledge base, embedding all of it might take minutes. You’d never make their upload request hang that whole time. Instead you drop a job onto a queue, hand the work to a background worker, and return right away.
The mental shift is from “do it now while you wait” to “promise to do it, and do it soon.” The toolkit here is broad. A message queue holds work items for workers to process. Pub/sub broadcasts one event to many interested subscribers. Event-driven architecture has services react to events rather than calling each other directly, and within that you choose between choreography (each service listens and reacts on its own, with no central conductor) and orchestration (a coordinator explicitly drives each step). Further out you’ll run into event sourcing (storing the stream of events as the source of truth) and CQRS (separating the write model from the read model), plus streaming for continuous high-volume flows.
Beyond keeping requests snappy, this absorbs spikes: a sudden flood of uploads just makes the queue longer, and your workers catch up. It also lets you scale the slow part, the embedding workers, independently of everything else.
Surviving load and failure #
More moving parts means more things that can break, so you design for failure instead of hoping it stays away. A handful of patterns do most of the work:
- Timeouts, so one hung model call doesn’t freeze the whole request.
- Retries with exponential backoff and jitter, so a struggling service isn’t hammered by everyone retrying in lockstep.
- Circuit breakers, which stop calling a service that’s clearly down and give it room to recover.
- Bulkheads, which isolate failures so one drowning tenant doesn’t sink the rest.
- Fallbacks, like handing back a polite “let me get a human” when the model is unavailable.
- Idempotency, so a retried refund issues one refund, not two.
You also protect the system on purpose with rate limiting, capping how much any single tenant can throw at you. That keeps one noisy customer from starving everyone else, and it caps your own model bill too. The usual algorithms are fixed window (count requests per minute, reset each minute), sliding window log (timestamp every request for an exact count, at the cost of memory), sliding window counter (a cheaper approximation of that), token bucket (tokens refill at a steady rate and let you absorb short bursts), and leaky bucket (requests drain at a fixed rate, smoothing bursts out). When a caller hits the limit you return an HTTP 429 and tell them when to try again.
Eventually the single codebase itself, the monolith, gets hard to scale, both technically and across a growing team. Splitting Helpdesk AI into independent services is one answer: a retrieval service, an inference gateway sitting in front of the model, an actions service for tools, and a billing service.
Each service deploys and scales on its own (the inference gateway needs far more capacity than billing ever will), and a single team can own one end to end. They talk to each other synchronously over REST or gRPC when they need an answer right now, or asynchronously over events when they don’t. But none of this is free, and you should not reach for it on day one. You trade in-process function calls for network calls that fail, one database for many, and easy local debugging for distributed tracing. Microservices solve an organizational and scaling problem. If you don’t have that problem yet, a well-structured monolith will carry you remarkably far.
The real lesson #
Look back at the path. You started with one box and a clean mental model, and at no point did you add complexity for its own sake. Each piece showed up because growth made something specific break, and you fixed exactly that.
That’s the whole idea. Scaling isn’t a checklist you apply up front, it’s a sequence of trade-offs you make under real pressure, measured against targets you set deliberately. Measure first, exhaust the cheap local wins, and only distribute further when your current setup runs out of room. Every tool here makes sense the moment the problem it solves shows up.