Debugging a ghost: the three-week Postgres mystery

This is a debugging story. You should not learn any generalisable lessons from it. The only reason I’m writing it down is that the resolution was absurd enough to be worth sharing, and because someone on the team has to keep the oral history or we’ll forget it and repeat the same mistake.

The symptom

A customer — let’s call them Mimosa, which is not their real name — runs a mid-volume Shopify store with about 800 orders a day. They’ve been on Wexron Hosting for two years. Their dashboard, which they use to track orders and shipments, has been rock solid the entire time.

In late November, one of their team opens a ticket: “the dashboard is slow on Tuesday afternoons.”

Not slow always. Not slow everywhere. Slow on Tuesdays, between roughly 2 PM and 5 PM IST, and only on the orders-by-status view. Every other page was fine. Every other day was fine. Every other customer was fine.

My first thought: coincidence. The next Tuesday I watched it live and it was, in fact, slow. That particular view took 12 seconds to load on Tuesday at 3 PM. On Wednesday at 3 PM, the same view took 180 milliseconds.

The red herrings

I’ll summarise the wrong turns because they’re the instructive part.

Red herring #1: cache invalidation. My instinct was that something was busting the Redis cache in a pattern that coincided with Tuesday afternoon. I spent two days tracing cache hit rates. They were flat. The Redis cache was healthy. Dead end.

Red herring #2: backup window. Wexron Hosting runs full backups at 2 AM every day and differential backups every six hours. I suspected the differential at 2 PM was causing IO contention on the customer’s node. I shifted their backup schedule to 4 AM, waited a week, and the next Tuesday afternoon it was still slow. Dead end.

Red herring #3: a cron job. The customer had a weekly sync with their warehouse management system. I assumed it ran on Tuesdays. I checked. It ran on Thursdays. Dead end.

Red herring #4: another tenant on the same node. We run a small number of tenants per physical host. I suspected a noisy neighbour. I moved Mimosa to an isolated node. Next Tuesday: still slow. Dead end.

By the end of week two I was wondering whether I was imagining the pattern.

The clue

Week three, Tuesday, 2:47 PM. I watched the slow query live in pg_stat_activity and something odd jumped out: the query plan had changed. Postgres was choosing a sequential scan on the orders table where it used to use an index scan. The plan for the same query on Wednesday morning used the index. Same query. Same data. Different plan.

This was the first useful observation in three weeks. It meant the statistics Postgres was using to choose a plan had changed, and changed in a way that was reverting by the next morning.

The culprit

Here is what was happening. I am going to explain it slowly because it is genuinely one of the most ridiculous things I’ve debugged.

Mimosa’s orders table has a column called status with values like pending, confirmed, shipped, delivered, cancelled. The distribution is extremely skewed — 94% of rows are delivered, 5% are cancelled, and the rest are split between the live states.

Postgres keeps statistics on this column for the query planner. The statistics are collected by autovacuum when it thinks enough has changed since the last sample. For a stable table, this happens once every several days.

Every Tuesday at 2 PM, Mimosa ran a bulk refund job. The job updated a large batch of orders from delivered to refunded — which is a status they’d added recently and which was not in the histogram of statistics Postgres had for that column.

Here’s the thing: autovacuum saw a bunch of UPDATEs happen and decided it should resample the statistics. When it did, the distribution of status values for the next few hours was temporarily weird — refunded was overrepresented relative to its long-term share. Postgres looked at the new histogram, saw that delivered was no longer the overwhelming majority, and decided the index on status was no longer selective enough to be worth using for queries filtering by other status values.

So it switched to a sequential scan.

By Wednesday morning, the long tail of new orders had restored the normal distribution, autovacuum ran again, and the plan flipped back to using the index.

The slow dashboard was a statistical artifact of the interaction between autovacuum’s sampling cadence and a weekly batch job that temporarily skewed the status distribution. It was self-healing. It had always been self-healing. Nobody had noticed for weeks because the window of slowness was narrow enough that most customers didn’t run the affected view during those three hours.

The fix

Two lines of SQL:

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

The first line tells Postgres to keep a much larger histogram for the status column — 1000 buckets instead of the default 100. The second line forces an immediate recollection.

With a larger histogram, the statistics can accurately represent both the long-term 94% share of delivered and the short-term spike in refunded from the batch job. The planner no longer needs to guess; it knows the real distribution. Since the fix, the Tuesday afternoon slowness has not recurred.

Total time from fix to resolution: 30 seconds. Total time from symptom to fix: nineteen days.

The generalisable lesson (there isn’t one, really)

The obvious takeaway would be something like “always check your query plans” or “be suspicious of weekly patterns.” Those are true but unhelpful — I already do both of those things and I still took three weeks.

The actual lesson, for me, was about the texture of intermittent bugs. When something is slow only sometimes, the temptation is to assume an obvious external cause: load, a backup, a cron. Those are the first things to rule out because they’re the most common. But once you’ve ruled them out, you have to get suspicious of the system’s internal feedback loops — the things it does to itself on its own schedule that nobody on your team wrote.

Postgres’s autovacuum is one of those feedback loops. So is Redis’s key eviction policy. So is the kernel’s page cache. So is the garbage collector on whatever language your hot path is written in. When none of your code is the culprit, the culprit is usually something that runs in the background on a clock you didn’t set, reacting to patterns you didn’t create deliberately.

The debugger in me thinks this is fascinating. The on-call in me thinks it is exhausting. They are both correct and they cannot agree on what to do about it.

Where to
next?

Debugging a ghost: the three-week Postgres mystery

The symptom

The red herrings

The clue

The culprit

The fix

The generalisable lesson (there isn’t one, really)

Keep
reading

Shipping AutoCom's order pipeline in six weeks

The anatomy of a 3-second order-to-ship pipeline

Debugging a ghost: the three-week Postgres mystery

The symptom

The red herrings

The clue

The culprit

The fix

The generalisable lesson (there isn’t one, really)

Keepreading

Shipping AutoCom's order pipeline in six weeks

The anatomy of a 3-second order-to-ship pipeline

Keep
reading