Is This Actually Getting Better? A Three-Layer Approach to Measuring AI Products

Main Conference Ballroom 2 Saturday at 2:35pm - 3:05pm

Many teams building AI products focus on parsing new context and editing prompts, without knowing if things are improving. They then learn that even a perfect model won't drive success if the product isn't easy to use, or people abandon it after one attempt. Getting the model right is necessary but on its own isn't sufficient.

This talk is about the measurement gap between a model that performs well in offline evaluation and an AI product that actually benefits users. Using a real-world case study, I'll walk through how we built three layers of measurement that gave us genuine confidence in product decisions and shaped team priorities.

These three layers - offline model evaluation, online usage metrics, and qualitative user feedback - don't always agree. I'll share why human judgement is essential to make key decisions when navigating conflicting signals, and how to do this with confidence.

If you're building AI products and can't answer "is this actually getting better?" - this talk is for you.

Teams building AI products often have many ideas and iterations — parsing new context, editing prompts — without knowing if things are improving overall, or whether fixing one issue is creating another.

Even a perfect model doesn’t guarantee success. It still needs to address core user needs: being easy to use and reliable. Without that, trust is lost and users abandon it after one attempt.

Using a real case study from Atlassian, I’ll walk through how we built three layers of measurement that gave us genuine confidence in product decisions and shaped team priorities:

Offline evaluation: How we defined an AI evaluation metric to know if outputs were high enough quality, and whether prompt changes were genuinely improving results or just shifting them
Online usage metrics: How we tracked drop-off through the product experience, whether users completed their goal successfully, and whether they returned
Qualitative signals: How user interviews validated — and sometimes contradicted — what the quantitative data told us

The honest part: these metrics don’t always agree and human judgement is needed. I’ll share how we navigated conflicting signals to make a deliberate, defensible release decision, and what that taught us about building AI products that ship with confidence.

If you’re building AI products and can’t answer “is this actually getting better?”, this talk is for you.

Wesley Volz

Wes is the founder of EmeraldRock Solutions, a data and AI consultancy helping organisations cut through the hype and build things that actually work. He has spent over 12 years working across technology, retail, banking and media — delivering everything from urgent C-suite analyses to enterprise-scale ML implementations.

At Atlassian, Wes led multiple data science teams across Product-led Growth and Atlassian’s DevOps suite: Bitbucket, Compass and DevAI. His work covered strategic analyses shaping product direction, Product Market Fit through to A/B testing. Most recently he led the team responsible for measuring RovoDev, Atlassian’s AI coding agent — developing the three-layer measurement framework he’ll share today.

Before that, at Quantium, he held a number of roles including the rollout of an ML-based forecasting solution to Walmart, the world’s largest retailer. Prior to that he helped train algorithms to personalise email offers for the Woolworths Group, and built custom audience segmentations for advertisers.

Wes grew up on a sheep and cattle farm in Western Queensland and has lived and worked across multiple countries. He brings the same practical approach to data problems that rural life teaches you to bring to everything else. He’s driven by a simple motto: leave things in a better state than you found them.

Is This Actually Getting Better? A Three-Layer Approach to Measuring AI Products

Wesley Volz

Share this session