The first warning we got was a member of the studio waking up to their phone vibrating at three in the morning. Not from a customer. From PagerDuty. GMC_DISAPPROVAL_RATE had crossed a threshold we didn't even remember setting, on a tenant we weren't actively shipping to, and 312 of their products had been kicked out of Google Shopping inside a 90-second window.
By the time we'd brewed coffee and pulled the dashboard up, another tenant was halfway through the same shape of disapproval wave. Two tenants down the same path independently, around the same time, with no obvious common factor. That's the kind of incident that's worth a teardown — because it's never as random as it looks at 3am.
What was actually breaking
The surface-level symptom was "Image link not crawlable" appearing on hundreds of items at once. The first instinct is always "the host went down". That hypothesis dies pretty quickly: pinging the image hosts directly returned 200s, browser loads were fine, and the affected URLs were mixed across both tenants' CDNs.
The actual problem was further upstream. Both tenants used a custom plugin that rewrote image_link values through a short-lived signed-URL service for transformation. The signing token had a 24-hour TTL. Our feed-sync job ran every two hours, which meant the signed URLs in the feed payload were always fresh — except on the GMC side, which had cached the previous fortnight's URLs and was now hitting them in a batch re-validation pass. The signed URLs were 410-ing because they were old. GMC interpreted that as "image not crawlable", marked the products as policy-violating, and queued them for mass-disapproval.
The fact that both tenants hit the wave at the same time was neither coincidence nor independence — they were both on the same GMC re-validation cycle.
The fix that should have been in place already
We patched the immediate problem in the obvious way: bumped the signing TTL to 14 days, then to permanent for image transforms that don't leak data. Products auto-recovered as the next GMC fetch cycle ran clean.
The structural fix took longer and was the actual learning. We added two boring pieces of monitoring that we should have shipped two years earlier:
- Image-link fetch monitor. We now run a sampled-fetch check against ~5% of every tenant's
image_linkURLs every 30 minutes, from outside our own infrastructure. Any 4xx/5xx fires a Slack alert before GMC sees it. - Disapproval-rate anomaly detection. The page at 3am was set up after the fact: a rolling 6-hour rate of disapprovals per tenant, with a 4-sigma deviation alert. The first tenant to wave-disapprove now pages us in seconds, not in the morning.
Why this is the right shape of fix
The temptation after an incident is always to go big — replatform the imaging service, write a custom GMC client, build a feed health score. We've seen agencies do all three after similar incidents. None of them help.
The boring fix is the right fix. Most production incidents don't need new architecture; they need two bits of monitoring you skipped because you couldn't see the value until they would have caught the thing. Build the monitoring as you build the system, not after the postmortem.
What we'd do differently
Two things, in order of impact. First, never let a third-party channel be the canary for something you can monitor yourself. GMC told us 312 items were broken; our own infrastructure should have told us that 30 minutes earlier. Second, signed URLs don't belong in feeds at all. The mental model of "this URL is temporarily valid" doesn't survive contact with caching channels. Rewrite the image-link strategy so any URL we put in a feed is permanently valid for at least the lifetime of the SKU.
Both of those changes have shipped on FeedPulse since. The 312-feed problem hasn't recurred — but the monitoring would tell us within half an hour if it did.