User Testing

Testing with real users, analysed with rigour. This is the step that separates opinion from evidence — and the reason the sprint exists.

Two Modes, One Purpose

Every Zebra Sprint ends in user testing. The mode depends on the engagement:

	In-Person Testing	Async Testing (Zebra User Testing Tool)
When	Complex products, ICP discovery needed	Clear flows, usability validation
Sessions	30-45 min, facilitator-led	5-15 min, self-guided
Depth	Deep — emotional reactions, follow-up questions, ICP weighting	Broad — natural environment, no facilitator bias
Scheduling	Coordinated sessions during sprint	Participants test on their own time
Best for	Sprint questions that need probing ("Can they articulate the difference?")	Sprint questions with observable pass/fail ("Can they complete the task?")

Both modes follow the same principle: 5 users find 85% of usability issues (Nielsen's rule). You don't need dozens. You need 5 focused sessions.

Both modes produce the same output: pass/fail answers to sprint questions, backed by evidence.

Async Testing with Zebra User Testing Tool

zebradesign.io/user-testing

Create a test in 2 minutes. Share a link. Get video recordings with transcripts. No scheduling, no facilitator, no coordination overhead.

Async testing is the default for most Zebra Sprints. It removes the biggest friction in user testing: getting 5 people in the same room (or call) within a sprint timeline.

How It Works

1. Create your test — Name, product URL, 1-2 sentences of context. Done.

2. Write 3-5 tasks — Each task starts with an action verb, has a clear endpoint, takes 1-3 minutes:

"Find the pricing page and tell us which tier fits your needs"
"Sign up for a free account and complete your profile"
"Try to export your data as a CSV file"

Keep it under 7 tasks. Fatigue kills honest feedback.

3. Share the link — Each test gets a unique URL. Send it to target users. No signup required for testers — they click, consent, record, done.

4. Review results — Video recordings, AI-generated transcripts, and timestamps for each task. Watch what they do, hear what they think.

Why Async Works

Natural environment — Testers use their own devices, their own setup. More realistic than a screen share with a stranger watching.
No scheduling friction — Timezone differences, calendar conflicts, last-minute cancellations — all eliminated.
More candid feedback — Without a facilitator present, people are less polite and more honest.
Faster turnaround — Send 5 links in the morning, have results by end of day.

The Async Gap

After the build days, the client sends the test link when ready. Users complete tests over days or weeks. The iteration day schedules when results arrive.

Neither party is blocked. The builder isn't waiting. The client isn't rushed.

In-Person Testing

In-person testing is reserved for engagements where the sprint questions need depth that self-guided testing can't provide. ICP discovery, emotional reactions, follow-up probing — these require a facilitator in the room (or on the call).

When to Choose In-Person

The ICP is undefined or debated — you need to weight feedback by user fit
Sprint questions require "why" answers, not just "can they"
The product is complex enough that context-setting needs a human
You're testing trust, perception, or differentiation — not just usability

Session Structure (30-45 min)

Intro (5 min):

"We're testing the product, not you — no wrong answers"
"Think out loud — say what you're noticing, feeling, confused by"
"Be honest — criticism helps us, politeness doesn't"
Ask permission to record

Task Flow (20-30 min): Walk through the prototype with specific tasks. At each screen:

Ask: "What do you think this is?" / "What would you do here?"
Observe: Where they hesitate, what they click first, emotional reactions
Don't lead: Let them struggle. Silence is okay.

Key Questions (10 min):

"What did you like? Why?"
"What did you not like? Why?"
"If you had a magic wand, what would you change?"
"How disappointed would you be if this product doesn't get built?" (most important — skip to this if short on time)

Debrief (5 min):

Any final thoughts
What stood out most

Capture Data Immediately

After each session, record:

Top 3 insights (what surprised you)
Biggest friction point
Strongest positive reaction
Quote of the session
Score: Would they use it? (1-10)
ICP fit assessment (ideal / close / partial / poor)

Planning the Test (Both Modes)

1. Map Sprint Questions to Screens

Every test starts here. Each sprint question maps to specific screens and observable behaviours:

Sprint Question	Screen to Test	What to Observe
Can users differentiate from competitors?	Landing, Processing	Can they articulate the difference?
Can users trust the output?	Report, Evidence	Do they click through to sources? Would they share?
Can users navigate the hierarchy?	Report Overview → Detail	Do they get lost? Can they return?

2. Recruit Testers (Client's Responsibility)

Who to recruit:

Target users who match the ICP as closely as possible
Mix of sophistication levels — some power users, some newcomers
People who will give honest feedback (not friends or investors trying to be polite)

ICP weighting: Not all feedback is equal. A target customer's confusion matters more than a non-ICP user's praise. Track ICP fit for each tester and weight findings accordingly.

3. Analyse Results

Sprint Questions Scorecard:

Question	Pass Criteria	User 1	User 2	User 3	User 4	User 5
Q1	Can articulate difference
Q2	Would share + clicked evidence
Q3	Moved through hierarchy

Pass/fail per sprint question — not overall satisfaction. Each question gets a clear answer.

Pattern extraction:

What's unanimous? (5/5 users = strong signal, act on it)
What's split? (Explore the variable — often ICP fit)
What contradicts? (Investigate — often different user types want different things)

ICP-weighted analysis: If 3/5 users pass Q1, but both failures are non-ICP users, that's a different finding than 3 random passes. Weight by ICP fit.

Common Patterns

"Evidence works when found" — Trust mechanisms work, but users can't discover them. Fix: make sources visible on every data point, not hidden behind interactions.

"Context is the variable" — Users WITH prior context see the value. Users WITHOUT context don't. This is a sales problem, not a product problem.

"Simplify the entry point" — Complex setup flows confuse users. Default to presets, offer customisation as optional.

Locked Decisions

Decision	Why
5 users per round (Nielsen's rule)	5 users find 85% of issues. More adds diminishing returns.
Async as default, in-person when depth needed	Async removes scheduling friction. In-person reserved for ICP discovery and depth.
ICP-weighted analysis	Non-ICP feedback can mislead. Weight by how well the tester matches the target customer.
Pass/fail per sprint question	Forces clear answers. "It went okay" is not a finding.
Async gap between build and testing	Neither party blocked. Client recruits on their timeline.

Research Tech Example (In-Person)

The first engagement used in-person testing because the ICP was undefined and sprint questions required depth probing.

Sprint 2 (19 January 2026):

5 sessions, 30-45 min each
Q1 (Differentiation) 3/5 pass, Q2 (Trust) 2/5 initial but 5/5 after evidence found, Q3 (Navigation) 5/5 pass
Key finding: "Chat reduces trust" — users with sophisticated use cases (VCs) wanted direct source access, not chat-mediated answers. 5/9 users across both rounds preferred direct source access.
ICP insight: Maciej (VC, data-focused) was the ideal customer. His feedback: "The report is not the product — I want structured format, then I need to go to source material."
Session duration halved between rounds (60 min → 25-30 min) — script improved, prototype improved

Sprint 3 (January 2026):

4 additional sessions with 3 ICP-fit users
Report categories finding: 3 ICP users requested domain-based organisation (Team, Market, Product, Financials) instead of signal-type organisation (Risks, Conflicts, Opportunities)
Strongest quote: "I've done the due diligence, I've highlighted the things I wanted to know about the team, technical aspects, market — and I don't see anything related to it."
This led to a complete report restructure in the final iteration day

Why in-person was right here: The "chat reduces trust" finding came from observing emotional reactions and asking follow-up questions — something async testing couldn't have captured. When the ICP is undefined, in-person testing is worth the scheduling overhead.

Previous: Build & Prototype | Next: UX Fixes