Last week we wrote about the supervision gap in agentic CRM: the execution layer has moved from humans to AI agents, but the verification layer has not moved with it. That post made the argument. This one names the specific failure patterns that fall through.
These are the patterns that look fine inside Klaviyo Composer, Salesforce Agentforce, or HubSpot Breeze, and look wrong from the customer's inbox. Each maps to a real failure mode lifecycle teams are encountering as agents move into production CRM workflows.
Most are not new in principle. They are old failure modes that get more frequent and harder to catch when an autonomous system is generating campaigns at machine pace. Per-campaign review can catch some. Aggregate, independent monitoring catches the rest.
1. Sender mismatch on agent-triggered sends
The agent generates a campaign and assigns it to a sender profile. The platform accepts the configuration because the sender is authorised at the account level. The campaign goes out from a sub-brand address, a recently archived sender, or a fallback configured during onboarding that nobody remembers. The team finds out from a customer reply.
This pattern lands hardest on multi-brand operators, B2B teams with account-managed senders, and any business running more than two authorised sender profiles. Single-brand DTC ecommerce teams with one sender configured see it less often, though the impact when it does hit is identical.
The platforms have no concept of a "wrong but permitted" sender. From the platform's perspective, the send is healthy.
What to monitor independently: the From address actually received on the inbox side, compared against the expected From address for the campaign type. Alert on any mismatch, including casing differences in display names and unexpected reply-to changes.
2. Unexpected sends
The risk is not that agents bypass approval. The risk is that approval is configured at the wrong scope. Mature teams have approval gates, but the gate depends on the configuration. A new flow added through the agent's iteration loop can bypass the human reviewer if the approval policy is scoped at the campaign level rather than the run level.
The platform's audit log shows the agent did its job. Governance has no trail of who decided. There is no record of a human approving the specific version that fired. This is the scope-of-approval failure, and it is the most common form of unexpected send in production today.
What to monitor independently: compare actual sends fired against an authoritative list of approved campaigns and flow versions. Alert when a send goes out without a matching approved version. This is the simplest of the five patterns to detect and the easiest one to get wrong, because the authoritative list has to be maintained independently of the agent. If the agent maintains the list of approved campaigns, the agent is grading its own homework.
3. Behavioural silence after an agent action
The agent runs a workflow. The send fires. Delivered counts look healthy. Expected downstream engagement drops to zero or near-zero. Opens, clicks, replies, conversions all flat.
What makes the agent case different from a human-built campaign is that the agent typically generates the segment, the content, and the timing in one autonomous decision. Three potential sources of silence converge into a single send, and the team has visibility on none of the three until the engagement signal lands. The cause could be agent targeting, a stale segment definition, a list-quality issue, a deliverability shift, or content the audience does not respond to. The pattern is the same regardless. The platform reports the send as successful because, technically, it was. The customer experience is wrong.
This matters disproportionately because workflow-triggered sends do most of the work. Industry benchmarks have long shown that automated flows account for a small share of total volume but a large share of revenue. A send that didn't go out in an automated flow takes out the highest-leverage sends a programme runs.
What to monitor independently: a baseline of expected engagement per flow type and per audience cohort, with an alert when post-send engagement falls outside the expected range. The signal is not "engagement is low". The signal is "engagement is meaningfully lower than this flow normally produces for this audience". Without the baseline, the alert is noise. With the baseline, behavioural silence becomes the earliest leading indicator of a wider failure.
4. Volume spikes from agent runs
The agent decides to send to a larger cohort than the team would have. Could be a segmentation bug, a misinterpreted prompt, or a deliberate expansion the agent's logic justified internally but no human reviewed. The platform accepts the send because the volume is within the account's send capacity.
The downstream effects are quiet and durable. A single volume spike that drives spam complaint rate above 0.3% will trigger immediate filtering at Gmail. Reputation metrics are rolling averages, so the damage compounds over the recovery window. Industry deliverability research puts recovery at four to twelve weeks for moderate damage and six to twelve months for severe damage. The agent reports the send as a success. The lifecycle team sees engagement on the cohort that did receive and concludes the campaign performed. The reputation hit shows up weeks later, in the next campaign's open rate.
What to monitor independently: a per-flow and per-campaign-type expected volume range. Alert on outliers above or below the range, weighted for time of day and seasonality. Volume below expected matters too, because it often indicates a different failure mode (excluded segment, broken trigger) but it surfaces through the same monitor. Around 70% of senders operate without reputation monitoring, which means most teams cannot tell the difference between a volume spike and a reputation drop until both have happened.
5. Agent quality drift
The agent's output degrades slowly. Subject lines get blander. Segment selection gets coarser. Content variants get more similar to one another. Each individual send still passes per-campaign review because nothing about any single send is obviously wrong. The trend only shows in aggregate, over weeks.
The phenomenon is documented in the AI research literature. Post-training routinely induces mode collapse in language models, where the model assigns disproportionately high probability mass to a narrow range of responses among many valid alternatives. Chroma's 2025 context rot research tested 18 frontier models and found that all 18 degrade in output quality as input context grows, well before the context window is full. Microsoft Security Research's 2026 paper on LLM failure modes catalogues version drift, mode collapse, and degeneration loops as distinct production failure patterns that surface specifically under agent and tool-use loads.
Platforms are built to measure per-campaign performance, not aggregate output quality over time. Drift is invisible at the per-send level by design.
What to monitor independently: a small set of quality signals tracked across every agent-generated send. Subject line diversity, distinct segment count, content variant breadth, engagement trend at the agent-output level rather than the per-campaign level. The shape of the signal matters more than the absolute number. A flattening curve across any of these is the early signal that the agent is regressing.
The honest difficulty is that you need a baseline to detect drift, and you do not have one until you start measuring. That is the case for starting now rather than later. Teams that wait for the baseline to be ready will only have it after the drift has done its damage.
This is the failure mode least likely to be caught by any review process, because no single review session sees enough sends to notice the trend. It is the most dangerous one to leave unmonitored, because by the time it shows up in revenue, the regression has been compounding for weeks.
What ties these five together
All five patterns share a structure. The platform's view of "did the send happen correctly" is not the same as the customer's view of "did the right message reach the right person at the right time". Agents accelerate the rate at which sends happen, which means they accelerate the rate at which the gap between those two views accumulates.
All five are also signals you would want when deciding whether to roll an agent back. Patterns one through four are immediate-response signals: detect, pause, investigate. Pattern five is the slow-burn signal that informs longer-cycle decisions about whether the agent is the right tool at all. The Sinch study we wrote about last week found that 74% of enterprises have rolled back a live AI customer comms agent, and 81% at organisations with mature guardrails. The teams doing the rollback well are the teams who can name what changed. Watching for these patterns is how you name it.
The other thing they share is that platform-native monitoring cannot close the gap on its own, because the platform reports on what it thinks it did. Closing the gap requires measurement from outside the platform, on the side where the customer actually receives.
A note on the data behind this post. The patterns above draw from the research cited (Chroma on context rot, Microsoft Security Research on LLM failure modes, industry deliverability data on reputation monitoring adoption), from public ESP status pages and incident reports, and from our own work with lifecycle teams in early access. As production deployments mature and the failure data becomes more publicly available, we will revisit these patterns with field data. For now, the framing is forward-looking by design, because the category is still establishing the baseline against which drift can be measured.
Telltide is built for that layer. Independent CRM journey monitoring that watches the five patterns above, and the rest of what platform-native monitoring misses, from the customer's inbox. If you have agents in production CRM, this is the layer you do not have yet. Start free at telltide.io.