AI Agents & incident management: the operational playbook

‍

It is 2:37 P.M. on a Tuesday. An unexpected spike in requests hits the contact centre of an energy provider. The AI Agent managing the chat channel begins responding with above-threshold latency. Four minutes later, integrations with the CRM start returning intermittent timeouts following an uncommunicated downstream API change. The monitoring system intercepts the anomaly before the volume becomes visible to the client's operations team. This is how the problem is taken in hand, and communicated, before it turns into noise.

This is the moment that separates vendors. Not in the demo. Not in the first month after go-live, when everything is monitored and closely watched. But in the second year, when AI Agents have become critical infrastructure, volumes have grown, and something goes wrong.

The question every stakeholder should ask during procurement is not "Is your system reliable?", but "show me exactly what happens when it isn't". This article is the answer.

Taxonomy and classification. Knowing what is happening before you act

A team of AI Agents in production can degrade in structurally different ways. Treating them all as "the system isn't working" is the fastest way to extend resolution times and communicate poorly with the client. The first useful step is a precise taxonomy.

Latency degradation. The AI Agents respond, but beyond the defined acceptability thresholds. In a voice channel, moving outside the "magic zone" of 2-3 seconds in response time breaks the natural flow of conversation. In a chat channel, exceeding 10 seconds leads to abandonment. The system is functioning, but not well enough to deliver the experience for which it was chosen.
Response error. The AI Agents produce output that is formally correct in format but wrong in content - they answer a question about a gas tariff with information relating to an electricity tariff, or cite a promotion that expired 48 hours ago. The cause is typically a misalignment between the knowledge base and the Agent configuration, which the update validation pipeline is designed to intercept and which, in residual cases, triggers the incident management process described here.
Failed integration. The AI Agents are unable to read from or write to an external system - CRM, back-office platform, ticketing system - due to a timeout, an uncommunicated API change, or a downstream authentication issue. The visible symptom for the end user is a vague response or an unnecessary escalation to a human operator.
Hallucination on a critical topic. The AI Agents generate a plausible but factually incorrect response on a high-impact subject - a contractual amount, a regulatory deadline, the conditions of a claim, or a cancellation procedure. In regulated sectors such as banking and insurance, this type of incident has implications that extend beyond customer experience and touch on compliance.
Service outage. The AI Agents do not respond. The channel is down. The impact is immediate and visible, conversations are not handled, customers are redirected to alternative channels or abandon altogether. It is the easiest incident to detect, because the perimeter of the problem is clear - though the complexity of the resolution depends on the upstream cause.

Severity classification: P1, P2, P3

Severity is not a subjective judgement - it is an operational classification that determines who responds, with what urgency, and how communication is handled. Defining it in advance, in writing, and agreed upon with the client, is the prerequisite for any SLA that makes sense.

P1 - Immediate production impact. The service is interrupted or severely compromised for a significant proportion of end users. This includes complete channel outage, latency degradation exceeding the critical threshold under sustained volumes, or hallucination on a critical topic that has already reached real users. Requires immediate intervention and regular client updates until resolution.

P2 - Detectable degradation, continuity maintained. The service is operational but with degraded performance - elevated latency on a minority of sessions, response errors on a specific intent category, and a secondary system integration not functioning. The end user perceives a reduction in quality, but the channel holds. Requires intervention within the contractualised window, with client communication in the same interval.

P3 - Non-critical anomaly. Unexpected behaviours that do not measurably impact the user experience - a suboptimal response pattern on edge cases, a monitoring metric that deviates from the baseline without breaching alert thresholds, and an isolated error log. It is tracked, analysed, and included in the improvement cycle. It does not require urgent intervention or immediate client communication, but it is reported in the monthly aggregated report and during periodic service reviews.

Detection and response. Intercepting the problem before it reaches the client

The prerequisite for good incident management is not depending on the client to discover that something is broken. This requires an observability layer that monitors continuously - 24/7, not only during staffed hours - across a set of operational metrics defined during onboarding.

Thresholds are not generic. They vary by sector, channel, and workflow design, acceptable latency on an asynchronous chat is not the one tolerated on an inbound voice channel, and the escalation rate that signals a problem depends on the conversational architecture in place. Every deployment has its own monitoring configuration, negotiated with the client and documented before go-live.

The dimensions typically monitored on a continuous basis include average and 95th-percentile latency, conversation completion rate, integration error rate, frequency of out-of-scope responses detected by guardrails, and channel availability. Depending on the deployment, these can be complemented by quality metrics evaluated on samples, response consistency with the knowledge base, conversation sentiment, and user rejection rate.

Alerts and escalation to the on-call team

When a threshold is breached, the system generates an alert. The escalation chain is predefined - not improvised in the moment of the emergency - and differentiated by severity.

A P1 alert immediately reaches the on-call team, available 24/7 under enterprise contracts. Not a generic ticketing system, but a team with direct access to the client's production configuration, able to intervene in the minutes following the alert.

A P2 alert reaches the service team with escalation to the client's technical lead if resolution exceeds the intervention window defined in the SLA.

A P3 alert is aggregated into the periodic monitoring report and brought to the client's attention in the regular review cycle - without generating unnecessary noise, but without concealing anything.

Client communication in the early stages

Communication during an active incident is a discipline in its own right. Three operational principles that distinguish a mature vendor from one that improvises.

The objective is that the first contact occurs before the client notices the problem, or at most simultaneously. Receiving a call from a client reporting a service disruption that was never intercepted by the monitoring systems is the signal that thresholds need to be revised, not an event to be accepted as unforeseeable.

Updates are delivered at defined intervals, not "as soon as there is news". In a P1, updates follow a tight cadence defined in the SLA - even when the only news is that the team is working and has not yet identified the root cause. Silence during an incident is more destabilising than a neutral update.

The language is operational, not evasive. "We are looking into it" is not an update - it is a response that increases anxiety without reducing it. A useful update follows this structure: what is happening, what we are doing, when the next update will arrive.

Resolution, root cause analysis, and post-mortem

Resolution actions follow an intervention hierarchy ordered by impact and reversibility.

The first line is always scope limitation - narrowing the Agent's response perimeter to the areas where it is behaving correctly, rerouting problematic requests to a human operator or to a courtesy message that does not expose the brand to risk. It is a temporary measure, but it keeps the service partially operational while work on the root cause proceeds.

The second line is rollback to the previous configuration. In indigo.ai's Self-improving Agents, suggest improvements act on the Agent's configuration - proposed by the system and approved by a human operator - not on the weights of the underlying model. Operating at the configuration level, rather than through fine-tuning, makes rollback fast, deterministic, and reversible, executable in production without perceptible service interruption. A rollback from fine-tuning, by contrast, typically requires hours or days.

The third line is direct intervention on the configuration - modifying instructions, updating the knowledge base, correcting guardrail policies - applied after the root cause has been identified with sufficient certainty to avoid introducing new issues.

Root cause analysis. Structured methodology

Root cause analysis is not an optional activity to carry out "if there is time" after a P1. It is a mandatory process with documented output, feeding both prevention and the post-mortem shared with the client.

The methodology follows a four-step structure.

Timeline reconstruction. The starting point is always a precise timeline - not an approximate narrative. When was the first anomalous signal detected? When did the alert trigger? When did the actual degradation begin, even if not yet visible? Distributed tracing systems allow this sequence to be reconstructed with fine-grained accuracy, identifying the component where the issue originated.

Root cause identification. The five-whys technique is applied systematically. The objective is not to find something to fix - it is to find the mechanism that allowed the problem to occur, so that the intervention targets that mechanism rather than the symptom.

Impact analysis. How many sessions were affected? What proportion of users received an incorrect or degraded response? Were there downstream impacts, such as incorrectly opened tickets or unnecessary escalations? If the incident involved personal data - for example, an Agent that returned one customer's data to a different user - indigo.ai, as data processor, notifies the client without undue delay and provides all the technical evidence needed to assess the risk. The decision on notifications to the authorities and to data subjects rests with the client as data controller, in line with the allocation of responsibilities set out in the Data Processing Agreement.

Non-recurrence plan. For each root cause identified, a specific action with a named owner and a deadline. Not "we will improve monitoring", but a targeted intervention with a defined metric and completion date.

Post-mortem shared with the client

The post-mortem is the document through which a vendor demonstrates that it cares about the relationship beyond the individual emergency. Its value lies not in the description of what went wrong - which the client already knows - but in the transparency of the analysis process and in the concrete assurances that the same problem will not recur.

The standard format includes an executive summary readable by a CEO without a technical background, a reconstructed timeline, the identified root cause, the quantified impact, corrective actions with owner and deadline, and a structured follow-up plan.

The delivery timeline is part of the SLA contractualised during onboarding, differentiated by incident severity. Not "when it's ready".

A structured follow-up review, typically in the weeks following resolution, is not a formality. That is where it is verified that corrective actions have actually been implemented, and the client has access to post-incident monitoring data to validate that the anomalous behaviour has not recurred.

Prevention. Every incident feeds the system

The cycle closes with prevention. Every documented incident contributes to three concrete outputs.

The first is an update to the monitoring system - new thresholds, new parameters,and new alert scenarios that the incident revealed as uncovered. Monitoring is not configured once and for all at go-live: it is refined over time, fed by real operational experience.

The second is an update to pre-deployment tests. Every identified root cause translates into an additional test in the update validation pipeline. If an incident was caused by a knowledge base change that was not tested against a subset of critical intents, that subset becomes a permanent part of the regression suite.

The third is input into the Self-improving Agents cycle. The conversations affected by the incident are analysed by the Observer Agent to extract patterns, evaluate the quality of the degraded responses, and propose targeted configuration updates - with human approval required before any change is applied in production.

‍

An AI system in production at enterprise volumes is never immune to degradation, edge cases, or integration surprises. The question is not whether it will happen - it is who you want beside you when it does. A vendor who learns about the problem from you and responds by improvising, or one who has already notified you, is already working on the root cause, and will deliver a structured post-mortem within the SLA-defined timeline.

Structured incident management is not just good engineering. It is the most concrete form of trust an enterprise vendor can build.

FAQ

How are monitoring thresholds defined, and who has visibility into them?

Thresholds are defined during onboarding together with the client's technical team, based on expected volumes, sector, and channel criticality. They are not standard values applied uniformly. The client has real-time access to the monitoring dashboard, and any updates to the thresholds are agreed upon and tracked.

What happens if an incident involves the personal data of end users?

This is a perimeter we treat with a separate priority from technical incident management. As a data processor, indigo.ai notifies the client without undue delay and supports the risk assessment with all the necessary technical evidence. The obligations to notify the authorities and, where required, to communicate the breach to affected users remain with the client as data controller, as set out in the Data Processing Agreement.

Is a post-mortem delivered for P3 incidents as well?

For P3s, the format is simplified: a monthly report of aggregated anomalies with the corrective actions taken, rather than a standalone document for every event. The principle is that every anomaly is tracked and analysed, but the depth of reporting remains proportional to impact.

Incident management for AI Agents in production. Taxonomy, process, and communication