Beyond AI agents: the next CX horizon begins September 18th
Save Your Seat
ASAPP logo icon.
👋 Want to talk to our generative AI agent?
Click below to experience GenerativeAgent in action
Talk to GenerativeAgent: Try it now
Learn more about GenerativeAgent first
I’m interested in a demo

Devidas Desai

Devidas Desai leads the Product and Design teams at ASAPP. Devidas is a seasoned product leader with over two decades of experience, and has consistently driven innovation at the intersection of enterprise communications and conversational AI. He is known for building products that blend technical depth with user-centric design, resulting in meaningful customer impact and business growth.

Prior to joining ASAPP, he served as SVP of Product Management at PolyAI, where he spearheaded the development of voice-first customer service solutions and launched Agent Studio—the world's first generative AI-powered, voice-first omnichannel CX platform. Earlier in his career, Devidas held product leadership roles at RingCentral, where he led the UCaaS product portfolio, and at Symphony.com, where he oversaw applications used by nearly half a million professionals across the world's largest financial institutions.

Generative AI for CX
Enterprise AI Systems

ASAPP’s Perspective on the AI Arms Race

by 
Devidas Desai
Article
Video
Aug 25
2 mins
3 minutes

Thriving with purpose, not just speed

The AI arms race is in full swing. Big Tech is spending billions, startups are surfacing weekly, and everyone’s showcasing the latest flashy demo or largest parameter count. But speed alone doesn’t define progress. At ASAPP, we believe the only metric that matters is outcomes.

We’re focused on one of the most complex and consequential problems in enterprise AI: Automating high-stakes conversations between enterprises and their customers. That means solving for reliability, deployment at scale, and real business value - not novelty.

What sets ASAPP apart

1. Prioritizing measurable outcomes over hype

Enterprise buyers don’t care about whose model has more parameters. They care about what moves the needle for their business. 

Did containment improve? Did AHT drop? Did CSAT rise?

That’s what we track. That’s what we deliver.

ASAPP technology is evaluated by customer KPIs, not vanity metrics.

2. Built for enterprise from day one

We aren’t retrofitting consumer-grade tech for the contact center. We’ve spent years engineering the infrastructure that powers mission-critical, real-time, and secure enterprise use cases:

  • Sub-second latency
  • High availability
  • SOC 2 compliance
  • Deep integrations with CCaaS and workflow systems

ASAPP is already deployed at scale across banking/financial services, telecom, insurance, travel & hospitality, and more.

3. Humans in the loop, by design - not as a backup plan

We don’t treat humans as fallbacks. We design systems, such as our Human-in-the-Loop Agent (HILATM) workflow, where AI and agents work in tandem.

Our GenerativeAgent offloads the repetitive, the structured, and the transactional so human agents can focus on interactions that require judgment, empathy, and escalation. 

This isn’t about replacing agents. It’s about increasing their capacity and confidence.

4. Investing in a true agentic ecosystem

We don’t believe in building one giant, do-everything bot. That approach breaks under real-world complexity. The future of automation is agentic—a coordinated network of intelligent agents, each designed for a specific role, working together in real time.

At the center is GenerativeAgent - our core AI agent built to handle the hardest customer conversations with accuracy, fluency, and context retention. But it doesn’t operate alone. 

GenerativeAgent is surrounded by an expanding ecosystem of specialized agents that enhance its performance, resilience, and adaptability.

  • Discovery Agents that handle upfront issue scoping and accelerate resolution paths
  • QA Agents that audit conversations and recommend improvements
  • Simulator Agents that run adversarial tests to surface edge cases and failure modes
  • BI Agents that convert dashboards into conversational analytics for business users and stakeholders

These agents don’t just support GenerativeAgent; they continuously improve it. Every interaction, review, and simulation feeds intelligence back into the system. 

This isn’t automation as usual. It’s a  learning architecture that shifts how enterprises design, deploy, and refine AI systems. 

The result? Higher uptime. Faster iteration. Lower risk. And a foundation that isn’t built for just cost savings, but for intelligence over time.

5. Research that ships

Our research isn’t theoretical; it’s operational.

We turn advances in retrieval, grounding, and evaluation into production improvements - faster deflection, reduced hallucinations, and more natural interactions. 

Breakthroughs are only valuable when they reach customers’ hands.

6. Trust isn’t a layer. It’s our foundation.

The AI race is riddled with risk: hallucinations, bias, security threats, data breaches, and loss of or exposure of PII. We don’t minimize those risks by tacking on safety measures as an afterthought. Instead, we engineer safety into the system as a foundational requirement.

Model grounding, observability, human control, and continuous monitoring are built into our stack. 

Safety isn’t an initiative. It’s a requirement.

Beyond the arms race

In short, ASAPP isn’t chasing headlines. We’re solving the problems that matter the most to our customers. In the AI arms race, our goal isn’t to be the loudest, but to be the most trusted, the most rigorous, and the most effective.

That’s how you thrive. Not by moving fast for the sake of it, but by moving with purpose - all while delivering results.

Generative AI for CX
Measuring Success

How to measure your generative AI agent performance (and why you can’t afford to get this wrong)

by 
Devidas Desai
Article
Video
Jul 17
2 mins
5 minutes

Most enterprises still evaluate generative AI like it’s a toy, measuring novelty instead of reliability. The AI sounded good. It used the right tone. It didn’t hallucinate (much).

But that’s not the real measurement of your generative AI agent. That's a demo theater.

In Priya’s article, she made the case for discarding “human-like” as the benchmark and replacing it with outcome-focused performance. I agree—and I’ll go one step further.

This article lays out what you need to measure, why it matters, and how to do it right. Focusing just on the parts that protect your brand and your bottom line.

Two measurement categories that actually matter

Let’s not overcomplicate this. There are two buckets worth tracking:

  1. Empirical metrics: These tell you what the system is doing—accuracy, resolution rates, escalation, latency, and error rates. If these numbers don’t exist in your reporting layer, you’re flying blind.
  2. Experiential metrics: These tell you how it feels to the user—clarity, effort, trust, satisfaction, and return usage. They don’t replace hard data; they validate it.
Two measurement categories that actually matter

To effectively measure your generative AI agent, you need both types of metrics. Measuring only one is how failed pilots go undiagnosed until you hit scale and start losing customers.

Key metrics (and what they actually protect)

If you’re only measuring sentiment or “positive interactions,” you’re not measuring anything meaningful. Metrics should exist to detect risk, quantify impact, and drive action. Below are the metrics for measuring the performance of your generative AI agent. They are signals that tell you if the agent is delivering business value or quietly failing at scale.

Empirical metrics

First Contact Resolution

  • Why it matters: It’s the most honest proxy for whether your generative AI agent works.
  • Track: % of sessions fully resolved by AI without handoff or reopen.
  • Target: Exceed human baseline within 90 days.

Error Rate

  • Why it matters: Uncaught errors compound quietly and publicly.
  • Track: Incorrect intents, misrouted flows, wrong data returned.
  • Fix: Tighten prompts, adjust training data, and raise confidence thresholds.

Containment

  • Why it matters: Abandonment and escalation are signals of failure.
  • Track: % of users staying in the channel. Break down by intent and flow.
  • Guard against: Over-containment at the cost of CX.

Escalation Frequency

  • Why it matters: Escalation is expensive. Frequent escalation = low trust or poor design.
  • Track: Trigger reasons—low confidence, policy boundaries, repeat queries.

Latency

  • Why it matters: Delay is the enemy of confidence, especially in voice interactions.
  • Track: Time to first response and time to resolution.
  • Expectations: <1.5s for simple tasks.

Confidence Calibration

  • Why it matters: A model that doesn’t know when it’s wrong is dangerous.
  • Track: Alignment between model confidence and actual outcome accuracy.
  • Use: To govern automation vs. escalation logic.

Learning Velocity

  • Why it matters: The cost of AI failure is in how long it stays broken.
  • Track: Time from gap detection → fix → deployment.
  • Target: Days, not weeks or months.

Experiential metrics

CSAT / NPS

  • Track: AI-handled vs. human-handled outcomes. Break down by workflow.
  • Avoid: Using this in isolation. Always pair with resolution + error rates.

Effort

  • Track: Survey or behavioral proxies (rephrasing, repeated queries).
  • Use: To identify friction points, not as a vanity score.

Trust Signals

  • Track: Drop-offs after vague messages (“Checking now…”).
  • Fix: Clearer next-step prompts and timeout handling.

Sentiment Drift

  • Track: Sentiment trends across interactions/conversations. Watch for frustration triggers.
  • Act: Adjust flows where tone or repetition causes friction.

Retention & Adoption

  • Track: Opt-in vs. opt-out rates. Usage trends that indicate customers are willing to engage with the AI agent again. Interpretation: Low repeat = a trust gap. Don’t ignore it.
Protect the brand, drive operational value, and scale with confidence

Governance - you don’t scale what you don’t control

Metrics aren’t a dashboard exercise. They should be seen as operational insurance. Here’s what enterprise governance actually looks like in practice:

Baselines and Targets

  • Establish human-agent benchmarks before go-live.
  • Set 30/60/90-day performance targets by metric.
  • Don’t launch new intents without clear success criteria.

Data Collection & Instrumentation

  • Log every decision point: intents, actions, engagement, latency.
  • Map user paths. Track where they drop, re-enter, or escalate.
  • Ensure privacy compliance. No excuses.

Analytics Infrastructure

  • Real-time dashboards. Alerting tied to thresholds.
  • Weekly ops reports. Monthly trend reviews for execs.
  • Tie reports to value creation, not marketing wins.

Feedback Loops

  • Cross-functional reviews: product, compliance, CX, ops.
  • Every high-error workflow gets an owner and a fix timeline.
  • Log every model or flow change with before/after metrics.

Risk Monitoring

  • Maintain an incident log. Track failure types and recurrence.
  • Build automated test suites. Run edge-case regression tests pre-release.
  • Use confidence metrics to throttle automation intelligently.

Versioning

  • Track model versions, knowledge and behavior versioning, prompt changes, config updates.
  • Have rollback plans. If something breaks, reverting should take minutes, not days.

Reporting

  • Executive summaries should surface impact, not volume.
  • Include performance, risks, and a point of view.
  • No number without a decision attached.

Everything listed above is table-stakes to your AI strategy, and everything not measured should be seen as optional. These metrics are how you catch failure early, prove success under scrutiny, and course-correct before customers or compliance teams notice.

Day 0 to Day 90—a deployment measurement roadmap

Generative AI agent rollouts fail most often because teams launch without a measurement plan. They ship, hope, and retroactively scramble to explain what happened. That doesn’t work in production environments. The roadmap below is what a responsible deployment looks like. It’s measured, accountable, and built to catch issues before they scale.

Week 0

  • Go live with initial scope and intents. Logging on. Surveys embedded.
  • Validate baselines against expectations.

Week 1–2

  • Monitor for latency, error spikes, and common escalations.
  • Fix easy bugs. Prioritize high-friction intents.

Week 2–4

  • Add intents. Tighten thresholds. Improve containment logic.
  • Begin sentiment + CSAT analysis.

Month 1–3

  • Compare against human baselines. Resolve or escalate any workflows below target.
  • Iterate on underperforming areas with measurable updates.
  • Lock success criteria before expanding scope.

If you can’t answer what changed between Day 1 and Day 90 with metrics, you’re not running a system, you’re running a guess. This timeline isn’t about speed but about control. You don’t get to expand scope until you can prove performance.

Making the numbers actionable

Metrics that sit on a dashboard don’t change outcomes, so unless the numbers lead to decisions, priorities, or escalations, they’re just background noise. This section details how to turn raw data into operational leverage so you can fix what’s broken, scale what works, and hold teams accountable.

Dashboards

  • Resolution, containment, CSAT, latency, escalation.
  • Drillable by intent, channel, region.

Heatmaps

  • Identify problem workflows. Prioritize by volume and cost.

Sentiment Trends

  • Visualize friction. Pair with NLU accuracy and rephrasing rates.

Progress Radar

  • Map six performance pillars against target thresholds.
  • Use this to align product and ops on where to invest.

If no one owns the metric, no one owns the problem. Build infrastructure that connects insight to action, including weekly reviews, threshold alerts, and executive visibility. Measurement isn’t the hard part. Acting on it is. And this is where most teams fail.

A final note for leaders

If your team can’t show how a generative AI agent performs, how it fails, and how it recovers (measured in real numbers, not impressions), you’re not in control of your generative AI agent but potentially exposed.

This isn’t about whether the generative AI agent sounds natural. It’s about whether it delivers, under pressure, with traceable decisions and minimal risk. That’s what earns trust from customers and from the business.

So ask the only question that really matters. When something goes wrong, how fast do we know it, and what happens next? If there’s no clear answer, you have work to do.