On April 25 at 2:27 AM Eastern, Klaviyo's APIs started returning intermittent 503 errors.

Klaviyo's status page acknowledged it at 3:24 AM Eastern on April 26.

That's a 25-hour acknowledgment gap. The fix went out within an hour of the public notice. For the day in between, customers had no signal that anything was wrong.

The errors were intermittent, not constant. That's not the part worth fixating on. The part worth fixating on is that the window opened on a Saturday morning Eastern time, ran through Anzac Day weekend in Australia and New Zealand, covered a full Sunday of European and US ecommerce sending, and was finally posted just as global Monday teams were sitting down with coffee.

This is the part of every ESP outage that doesn't get talked about.

What the resolution notice quietly admits

Klaviyo's own closing line tells the real story: "Any requests that returned 503s during the outage window will need to be retried."

Translation: data was lost. Customers need to backfill it.

For one-off API calls, that's manageable. For high-volume event streams feeding lifecycle automation, "retry" is a polite way of saying "good luck reconstructing what should have happened."

What the impact actually depends on

Whether your team felt this depends on three things, and most CRM leads don't know the answer to any of them for their own stack.

How your source systems handle 5xx responses. Shopify retries webhooks with exponential backoff for up to 48 hours. Custom event pipelines might retry once and give up. Some don't retry at all. The same outage produces zero data loss on one integration and a 25% event drop on another, in the same account.

Which flows depend on which events. Welcome flows triggered by a subscribed_to_list event from a third-party form behave differently to ones triggered by a Klaviyo-native signup form. Post-purchase flows triggered by a Shopify order_created webhook behave differently to ones triggered by a custom event from a headless storefront. The blast radius of an API outage maps to your event topology, not your campaign calendar.

Whether anyone is checking flow trigger volume. Klaviyo's flow analytics will show a drop in triggered events if you know to look. Most teams check flow performance weekly at the conversion level, not daily at the trigger-volume level. A partial degradation across a weekend, against normal weekend variance, will not look obviously anomalous.

Why the dashboard doesn't help

ESPs report on what their systems did. They cannot report on what their systems should have done but didn't.

A flow that never triggered looks identical to a flow with no eligible audience. A send that failed silently looks identical to a send that wasn't scheduled. There's no error state. There's just absence.

Even where the data is technically available, like Klaviyo's flow trigger metrics, it requires the operator to already suspect a problem and go looking for it. The dashboard is reactive, not proactive. If you're not already asking "did my welcome flow fire 14% less than expected last weekend," the dashboard won't volunteer it.

Why the status page doesn't help either

The status page is supposed to be the bridge. The April 25 incident shows what happens when the bridge takes a day to appear.

This is not unique to Klaviyo. StatusGator has logged nearly 500 Klaviyo incidents since 2019. Every major ESP has a comparable record. HubSpot, Braze, Iterable, Adobe Campaign, Salesforce Marketing Cloud, all of them. Acknowledgment lag is a category-wide pattern.

The further upstream the issue is in the platform's stack, the longer it takes to surface publicly. API gateway issues, queue backups, identity service degradation. All can run for hours before they're visible to the engineering team triaging them, let alone written up for a status page.

And even after acknowledgment, the gap continues. The notice describes what was technically broken. It doesn't translate that into customer-facing impact. It doesn't list which flows would have been affected. It doesn't tell you which audiences to backfill or which sends to replay.

That translation work is left to each customer to do for themselves. Most don't, because the people watching the status page and the people accountable for journey performance are usually different people, on different teams, with different escalation paths.

What to do about it

Three things, in increasing order of effort.

Subscribe every CRM platform's status page to a shared Slack channel. Klaviyo, your CDP, your CRM, your identity provider. Most have RSS or webhook delivery. Costs nothing. At least gives you a timestamp to anchor against when a flow underperforms next week.

Audit your highest-revenue flows for trigger-source fragility. For each of the top five revenue-driving journeys, write down which event triggers it, which system emits that event, and what that system does when it gets a 5xx response. Most teams have never documented this. The exercise alone exposes the weakest links.

Run independent monitoring of journey-level outcomes. Not deliverability testing, which checks whether your email reaches the inbox under normal conditions. Journey monitoring: did the trigger fire, did the send go out, did the right content render for the right audience, did the link work. Inbox-out, not dashboard-out.

The April 25 Klaviyo incident is closed. The status page has moved on. Most CRM teams who were affected will never know they were.

That 25-hour silence is the part that costs the most.

Catch the next ESP outage before anyone else does

Telltide monitors your CRM journeys from the inbox side. When your ESP's status page is silent but your sends are not arriving, Telltide alerts your team within hours, not days.

Start free See the 20-second demo