Running 200+ production workloads from a Calicut office

When I tell people we run 200+ production workloads with a three-person ops team in Kerala, the usual reaction is a polite “that sounds stressful.” It used to be. It hasn’t been for a while. The reason isn’t that we got smarter or that the workloads got smaller — it’s that we made a few unglamorous decisions early that keep paying dividends every day.

This is a brain dump of the most important ones. None of them are novel. All of them are boring. Taken together they mean I don’t get paged at 3 AM about somebody’s WordPress cache.

1. Uniformity is worth more than customisation

The single biggest force multiplier: every workload runs on the same stack. Nginx, a PHP-FPM pool or a Node process, a Postgres or MySQL database, a Redis instance, a Cloudflare-fronted domain. If a customer wants something weird — their own Dockerfile, a custom nginx config, a specific PHP version not on our default — we quote a premium and we mean it.

This sounds inflexible, and it is. It’s also what makes the math work. When every server looks the same, any engineer can debug any incident. When an alert fires, the runbook is the same. When we roll out a security patch, we patch once. When we need to migrate a customer to a bigger node, it’s a rsync and a DNS flip, not a project.

Customisation is a tax you pay every time something goes wrong. We charge for it so we can afford to not have it by default.

2. The bastion is the only way in

Nothing in our fleet accepts SSH connections from the internet. Every engineer connects through a single bastion host with MFA enforced and IP allow-lists per team member. From there, a short-lived certificate grants access to the target for exactly the duration of the session.

This sounds like security theatre until the first time somebody tries to brute-force a server and gets absolutely nowhere. It also means onboarding a new engineer is a five-minute task (add their key to the bastion, grant a role) and offboarding is ten seconds (revoke the cert). We’ve never had to rotate SSH keys across 200 servers because nobody has SSH keys on 200 servers.

3. Monitoring that nobody reads is worse than no monitoring

We used to have Grafana dashboards for everything. Dozens of them. Disk I/O, CPU breakdown by process, memory pressure, query rate. Nobody ever looked at them except during post-mortems, where they were mostly useful for constructing a narrative after the fact.

What we actually use:

One alert channel per customer. The customer’s name is in the alert. If it fires, somebody is getting paged — either us or them. No ambient dashboard-watching.
One SLO per workload: “is it up and responding in under 500ms?” Everything else is diagnostic noise. If the workload is up and fast, nothing else matters for alerting purposes.
A weekly PDF to every customer with the four numbers they actually care about: uptime, median response time, p95 response time, data transferred. Three of us write zero of these — a cron job generates them.

The discipline isn’t adding monitoring; it’s deleting the monitoring that isn’t producing actions. Every dashboard we killed saved us about ten minutes a week of nobody looking at it.

4. Backups that you’ve actually restored

This is the one I still get nervous about. Everybody backs up. Almost nobody actually restores. A backup that hasn’t been restored is aspirational storage, not a backup.

Our routine: every Monday, one of us picks a random customer and does a full restore into a staging environment from the most recent daily backup. We time it. We verify the data. We write down what went wrong. This has caught:

A customer whose backup was running but the restore was missing a table because of a schema change nobody had noticed
A bug in our own cron script where backups for databases larger than 12 GB were being silently truncated
Three cases of customers who had changed their own encryption config in ways we hadn’t updated our restore path for

None of these would have been found by monitoring. All of them would have been catastrophic during a real incident.

5. Boring databases

We run Postgres. That’s the whole strategy. We don’t run NoSQL because we don’t need NoSQL. We don’t run Cassandra because nobody on the team has operated Cassandra at 3 AM and you can’t onboard those skills on the night of an incident.

When a customer asks if they can have MongoDB, we say yes and we spin them up a managed instance at an external provider, and we bill them for it transparently. We don’t pretend to operate it. “We operate one thing really well” is a simpler and more honest pitch than “we operate seven things at various levels of competence.”

6. The playbook beats the hero

The most expensive kind of ops engineer is the one who can fix anything but whose fixes live in their head. When Sam had a week of leave last year, I went through every incident from the previous three months and checked whether it was documented in a runbook. About 40% of them weren’t. Those were the incidents where the resolution involved “Sam knew what to do.”

This is a latent risk that only shows up the first time the person who knows what to do isn’t around. My job for the next two weeks was turning Sam’s head into markdown. Not for her benefit — for mine, and for the person we hire next.

A runbook is a forcing function for the ugly question: “what would the least experienced person on your team do if this fired right now?” If the answer is “page the hero,” you don’t have a runbook; you have a single point of failure wearing a hoodie.

The math on a three-person team

The headcount question: how do three people actually run 200 workloads?

~90% of incidents are one of about a dozen known categories with a runbook
~7% are new incidents but fit an existing pattern — one of us figures it out, and it becomes a runbook by end of day
~3% are genuinely novel — multi-hour war-room situations, which thankfully happen maybe once a quarter
The rest of our time is automation, customer onboarding, security patches, and writing the next version of the control plane

The uncomfortable truth is that running a managed service is mostly not exciting. Most of what we do is routine. The reason it looks impressive from the outside is that we successfully resist the temptation to make it interesting.

Interesting ops is broken ops. Boring ops sleeps through the night.