‘Does it Work?’ is the Wrong Question: How to Validate for Learning Outcomes, Not Just Usability

Let’s play out a scene that’s painfully familiar in EdTech. The Product Manager runs a classic A/B test on a new feature.

Version A is a simple, five-question multiple-choice quiz.
Version B is a more complex, scaffolded activity. It asks learners to write a short response, then self-assess against a rubric, and finally, review an ‘expert’ answer. It’s clunkier, has more steps, and is a unique component.

The results come in. Version A is the clear winner. It has a 40% higher completion rate and users spend 60% less time on the page. It’s a slam dunk. The team celebrates, high-fives over Slack, rolls out Version A, and moves on to the next ticket in the 'Feel-Good Feature' epic.

The problem? We didn't answer the only question that matters: Which version made the student learn more?

We’re all so addicted to the clean, quantitative validation of traditional software that we’ve forgotten we’re not building a CRM. We’re building for the human brain. And in EdTech, ‘Does it work?’ (like a well-oiled machine) is the wrong question.

The right question is: ‘Does it teach?’

‍

The great validation lie: usability vs. efficacy

We’re a data-driven culture, but we’re often worshipping the wrong metrics. Let’s start by separating out two important buckets of data:

Usability: Is the feature easy to use? Is it fast? Is it 'intuitive'? Do users like it?
Efficacy: Does the feature cause a change in knowledge or behaviour? Does the learner leave smarter, more skilled, or more capable than when they arrived?

Here’s the hard truth: These two are often in direct conflict.

Another example, to illustrate: In our house, we’re big fans of edutainment. We can spend hours on YouTube watching farmfluencers building goat shelters and milking stands, while believing that what we are doing is productive because we are ‘learning’ how to apply these methods on our own farm. It’s fun, it’s easy, and it’s time well spent.

But when we stand in the hardware shop a week later, ready to buy the materials and get going, we suddenly have no clue where to start. Why? Because we didn’t actually learn anything from those hours of content.

The most effective learning is hard. Cognitive science calls it ‘desirable difficulty’ (Bjork, 1994).

It’s the struggle of retrieving a memory, the frustration of applying a new concept, the mental friction that actually makes learning stick. But our validation methods are designed to eliminate friction.

‍

The usability trap: 3 common metrics we over-rely on

We tend to grab the easiest data, which is almost always the wrong data. This leads us straight into the usability trap, where we obsess over metrics that feel productive but tell us nothing about learning. Here are some things we tend to over-emphasise in our industry.

‍

1. Completion rates (the 'did they finish?' fallacy)

This is a sugar-rush metric. It feels great to report '85% of users completed the new module!' But what does that mean? Did they 'complete' it by just clicking 'Next' five times? The easier, less-effective option will always win on completion.

If we care too much about completion rates, we're optimising for clicks, not cognition.

‍

2. 'Smile sheets' (NPS/user feedback)

Here's an unpopular opinion of mine: learners are awful judges of what helps them learn. We want to believe the user is always right, but in learning, they often aren't.

Studies (like Soderstrom & Bjork, 2015) show that learners consistently prefer passive, easier methods (like watching videos) because they feel fluent and easy. They mistake that feeling of fluency for mastery. But 'desirable difficulty' (like forcing them to retrieve information) is what actually works, even though it feels harder and less satisfying.

When you ask 'Did you like this feature?' you are, in some ways, running a poll on which feature was the least effective.

‍

3. Time on page (the 'engagement' myth)

This is the most meaningless metric of all.

A student spends 10 minutes on a page. Are they deeply engaged in a complex activity (a learning win)? Or are they hopelessly lost in a terrible UI (a usability fail)?
A student finishes in 30 seconds. Are they a genius who absorbed it all instantly (a learning win... maybe)? Or did they click 'Next' without reading a word (a learning fail)?

The metric is useless. Stop reporting it.

‍

How to start validating for learning (the scrappy way)

Okay, so the standard SaaS playbook is out. What now? We can't all run multi-year, randomised controlled trials (RCTs) for every feature.

You don't have to. You just have to be more creative. You need practical, scrappy EdTech validation. Here are three methods you can start using this week.

1. The pre/post-test playbook

This is the most basic, powerful tool we have for measuring learning efficacy.

What it is: A simple method to measure the 'gain' in knowledge from using your feature.
How to do it:
1. Before a user cohort sees your new feature, give them a short pre-test (5-10 questions) on the core concepts.
2. Let them use the feature.
3. Afterwards, give them a post-test.
Pro tip: The post-test cannot be the same as the pre-test. It must test the same concepts but using different questions. This avoids test familiarity and measures true understanding, not just short-term memory.‍
What to measure: The gain score (Post-test % – Pre-test %). This is your new KPI. It moves the conversation from 'Did they finish?' to 'Did their score improve?'

2. The 'teach-back' interview

This is my favourite way to blow up a useless user-interview script.

What it is: Using active retrieval as a qualitative assessment tool. It's built on the principle that if you can't teach something, you don't know it (Roediger & Karpicke, 2006).
How to do it: Get a user to test your new feature. At the end, do not ask 'What did you think?' Ask:
'Great. Now, imagine you have to explain [The Core Concept] to a new colleague. How would you describe it?'‍
What to listen for: The 'ums,' the pauses, the fumbling for words, the use of analogies. That is your data. When they stumble, you've found a gap in your feature. When they explain it perfectly, ask, 'What part of the feature helped you understand that?' You'll pinpoint the exact moment of learning.

3. Analyse error-patterns, not just clicks

Your quiz and assessment logs are a goldmine, but you're probably looking at the wrong thing.

What it is: Treating wrong answers as data, not just failures.
How to do it: Export your quiz results. Ignore the 'percent correct' (that's a vanity metric). Instead, look at the distractors (the specific wrong answers).
The key question: Are learners all choosing the same wrong answer? If 60% of learners get Question 3 wrong, and 90% of them choose 'Answer B,' you haven't just found a 'gap'. You've found a misconception that your feature (or the upstream content) is actively creating or reinforcing.‍
The pay-off: This is how an LXD defends their 'inconsistent' design: 'Yes, the completion rate is 10% lower. But the data shows that the 90% who do complete it are 50% less likely to make this critical error on the final assessment. We're trading 'easy' for 'effective'.'

‍

How to convince your stakeholders

I know what you're thinking. 'This is all great. But my CPO just wants to see the engagement chart go up and to the right.'

This is the unruly part of the job. Your role isn't just to build features; it's to educate your own organisation.

1. Speak their language: efficacy = retention

Translate 'learning' into 'business.'

Usability wins the first click.
Efficacy wins the renewal.

If your product actually works (i.e., learners get smarter, get promotions, pass their exams), they will stay, and they will become evangelists. Efficacy is your single best long-term retention and growth strategy. It's not 'fluffy' pedagogy; it's your core business asset.

2. Create an 'efficacy dashboard'

Don't fight data with feelings. Fight bad data with better data. Build a simple dashboard with your new metrics.

Metric 1: Average pre/post-test gain score (by feature)
Metric 2: Key concept 'teach-back' success rate (qualitative)
Metric 3: Reduction in 'critical error patterns'

This looks as serious and data-driven as any usability dashboard, but it's 100x more valuable.

3. Run a 'validation pilot'

Don't try to boil the ocean. Pick one upcoming feature. Tell your boss: 'Let me run our usual usability test, but I'm also going to run a scrappy efficacy test in parallel. Let's compare the findings.' This is a low-risk, high-reward proposal they can't refuse.

Before you start wireframing, can you confidently answer ‘yes’ to these questions?

‍

Stop asking 'does it work?'

Validating for usability is easy. It gives you clean charts, happy stakeholders, and the illusion of progress.

Validating for learning is messy. It's qualitative, it's slower, and it often gives you uncomfortable answers. It forces you to admit that your beautiful, 'intuitive,' Dribbble-worthy feature taught absolutely nothing.

But this is the job. You’re not just a feature-pusher. You are, whether you like it or not, an educator.

So the next time you're reviewing a feature, stop asking 'Does it work?'

Start asking 'Does it teach?'

‍

Frequently asked questions (FAQ)

Q: What's the main difference between usability and learning efficacy? A: Usability is about ease of use: Is the feature intuitive, fast, and frictionless? Efficacy is about the outcome: Does the feature cause a measurable change in the learner's knowledge or skill? The most effective learning often has more friction, not less.

Q: Can't I just use A/B testing for EdTech? A: You can, but you must measure the right thing. An A/B test that measures completion rate (a usability metric) will almost always favour the easiest, least effective option. A good EdTech A/B test would measure the post-test score between Group A and Group B.

Q: How long should a learning validation test take? A: It can be very scrappy. You can run a 5-question pre/post-test with a cohort of 20 users. You can add 'teach-back' questions to 5-10 of your normal user interviews. The goal isn't to get a statistically perfect sample size for an academic paper; it's to get a directional signal that's stronger than a simple 'like' button.

‍

Sources

Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe & A. Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 185–205). Cambridge, MA: MIT Press.
Roediger, H. L., & Karpicke, J. D. (2006). Test-Enhanced Learning: Taking Memory Tests Improves Long-Term Retention. Psychological Science, 17(3), 249–255.
Soderstrom, N. C., & Bjork, R. A. (2015). Learning Versus Performance: An Inconvenient Truth. Psychological Science in the Public Interest, 16(1), 1–11.

‘Does it Work?’ is the Wrong Question: How to Validate for Learning Outcomes, Not Just Usability

The great validation lie: usability vs. efficacy

The usability trap: 3 common metrics we over-rely on