Nimrod Broshy

Nimrod Broshy is Director of Product at ASAPP, where he leads the Supervisor Suite — a set of products that gives enterprises visibility and control over AI-driven customer interactions. His work focuses on helping organizations optimize AI agents, quantify business impact, and identify high-value automation opportunities, while turning real-time customer conversations into actionable insights.
Moving beyond containment: How to truly measure the performance of your AI agent
As AI agent adoption accelerates, CX leaders are facing a fundamental question: how do you know if your AI agent is doing a good job?
For years, organizations have relied on a simple metric to answer that question: containment rate. If the AI agent handled an interaction without escalating it to a human, it was considered a success. If not, it was viewed as a failure.
But in an era of sophisticated AI agents that can reason, act, and collaborate with humans, containment is an increasingly blunt instrument. It measures whether an interaction stayed within the automated system, but it says very little about whether the customer’s problem was actually solved—or whether the experience was any good.
It also fails to account for a critical performance and safety issue: Did the AI agent hallucinate or make other mistakes? And if so, did those errors affect the outcome for the customer or the enterprise?
To truly assess the impact of AI agents, CX leaders need a more comprehensive evaluation framework that focuses on the value delivered – and factors in the accuracy of the agent’s responses and actions.
The limits of containment
Containment has always been an imperfect proxy for success. It assumes that if an interaction didn’t require a human agent, the outcome must have been positive. It also assumes that if an interaction involved a human agent at all, the automation failed.
Neither of those assumptions is necessarily true.
Imagine a customer asking an AI agent whether they can bring a 100-kilogram dog on a flight. If the AI agent confidently says yes, and the customer hangs up satisfied, the interaction would count as contained. From a traditional reporting standpoint, that looks like a win.
But if the AI agent delivered incorrect information, that would create a much bigger problem for both the customer and the airline. The traveler might arrive at the airport with their dog only to discover it cannot board the flight. In a case like this, containment masks a serious failure.
On the other hand, if the AI agent transferred the customer to a human, the interaction would have failed the containment standard. But if it had already gathered the necessary details from the customer, looked up the relevant policies, and passed that information to the human agent, then it would have helped the customer get to a resolution quickly without too much effort.
And if the AI agent simply consulted a human agent for a judgment call without transferring the customer, then the customer experience would be significantly improved. And yet, the interaction would still not fully meet the containment standard.

This example highlights why CX leaders need to rethink what success actually means for automated interactions. The goal of an AI agent is not simply to contain conversations. It’s to deliver value by resolving the customer’s issue accurately while minimizing friction for the customer.
And that requires a different way of measuring performance.
Measuring performance through goal completion
A more meaningful metric is goal completion, or in other words, achieving customer outcomes.
Instead of asking whether an interaction was contained, CX leaders should ask whether the AI agent successfully resolved one or more of the customer’s goals.
Customers often contact support with multiple needs. They might want to check the status of an order, update their contact information, and ask a question about a policy—all in the same conversation.
If an AI agent successfully completes even one of those goals before transferring the customer to a human agent, it has already delivered real value. The customer spends less time waiting, and the human agent handles less of the interaction. Overall, the AI agent has improved both the customer experience and contact center productivity.
This way of thinking reframes automation as a contributor to resolution, rather than an all-or-nothing replacement for human support.
Goal completion recognizes the reality of modern CX operations: the most effective systems are hybrid environments where AI and humans collaborate to solve customer problems. The AI agent handles what it can confidently resolve and escalates the rest when human judgment is required.
Evaluating success through this lens allows CX leaders to better understand how automation contributes to outcomes, instead of simply measuring how often it replaces a human.
Guardrails that ensure safety and accuracy
Delivering value is only part of the equation, though. AI agents must also operate within strict behavioral standards.
If an AI agent’s responses contain hallucinations or other incorrect information, or violate company policy, the interaction can’t be considered successful.
Ensuring this level of reliability requires more than just careful prompting and training. Leading AI systems use automated guardrails that continuously evaluate the agent’s responses in real time.
These automated guardrails help ensure that the AI agent is not just active, but reliable and trustworthy, two qualities that are essential for customer-facing automation.
The guardrails can also enable automated quality reporting and real-time alerts for potential issues that reach a pre-determined impact threshold. That visibility is a key component of measuring the AI’s performance and value. Flagged issues also provide a feedback loop for the AI, which guides fine-tuning, so the AI agent’s performance continuously improves.
Why observability is non-negotiable
Even with advanced guardrails, one principle remains true – you cannot measure (or fix) what you cannot see. Deep observability is critical for understanding how an AI agent actually performs in the real world.
CX leaders need tools that provide full transparency into the AI’s decision-making process. That means more than just reviewing a transcript. Teams should be able to see the actions the agent took, the reasoning behind those actions, and the information it relied on to generate its responses.
This level of visibility makes it much easier to diagnose issues and improve performance. If an AI agent fails to resolve a request, teams can identify whether the problem was caused by missing knowledge, an incorrect API call, or a hallucination.
Even as automation becomes more sophisticated, human oversight remains essential. Observability tools must make it easy for CX teams to conduct manual reviews, investigate anomalies, and ensure the system continues to meet quality standards.
Scoring AI agents like human agents
Contact centers have long relied on structured quality assurance programs to evaluate human agents and guide performance improvement. Conversations are reviewed, scored, and used to identify coaching opportunities.
AI agents should be evaluated with the same rigor. A comprehensive scoring framework should assess not only accuracy, but also experience quality.
For example, a conversation might technically resolve the customer’s request but still create unnecessary friction. Perhaps the AI agent asked the customer to repeat information multiple times, or it failed to recognize a conversational cue indicating frustration.
These details matter. Customer experience is shaped not just by whether a problem is solved, but how smoothly the resolution occurs.

A holistic scoring model should incorporate factors such as:
- Whether the customer’s goals were achieved
- Whether the agent provided accurate and grounded information
- Whether the conversation flowed naturally
- Whether the customer encountered friction or delays
- Whether the AI escalated appropriately when needed
By assigning a unified quality score to every automated interaction, CX leaders gain a clear picture of how their AI agents are performing across the board.
Low-scoring conversations can then be flagged for human review, allowing teams to identify patterns and make targeted improvements. Over time, this creates a continuous feedback loop that strengthens both the AI system and the customer experience.
Rethinking what success looks like
As AI agents become more capable, the metrics used to evaluate them must evolve as well.
Containment alone cannot capture the complexity of modern automated interactions. It tells you whether the AI handled the conversation – but not whether it handled it well.
A more meaningful evaluation framework focuses on value delivered and experience quality. By measuring goal completion, enforcing strong guardrails, enabling deep observability, and applying comprehensive quality scoring, CX leaders can gain a far more accurate understanding of their AI agents’ performance.
In doing so, they shift the conversation away from simple deflection metrics and toward a more important question:
Is the AI actually making the customer experience better?
When that becomes the standard for success, AI agents stop being tools for reducing workload and start becoming engines for delivering real value—to both customers and the business.











