We tested our AI personas against Nobel-Prize psychology. One was off by a single percentage point.

In 2008, Dan Ariely ran a now-famous experiment for "Predictably Irrational." People were shown three subscription options for The Economist: Web only for $59, Print only for $125, and Print + Web for $125.

84% of real humans picked the bundle. The presence of the "decoy" Print-only option made the bundle feel like a steal. It's one of the most replicated findings in behavioral economics.

We ran the same experiment on our production synthetic audience. No prompting tricks. No tuning. Same options, same wording. The result is below.

— FIG. 02 — DECOY EFFECT (ARIELY 2008)

Published — real humans

Predictably Irrational, p. 12

Prior.Run synthetic users

production audience, cold run

— Off by a single percentage point. No fine-tuning. No prompt engineering. The same synthetic audience that scores every Prior.Run analysis.

— The setup

If you sell anything that claims to model human behavior, you should be willing to prove it.

We decided to take Prior.Run's synthetic users to the gold standard: classic experiments from Tversky, Kahneman, Ariely, Iyengar, Johnson & Goldstein. Seven of the most cited findings in modern behavioral economics. Each has a published percentage from a real human sample. Each is a falsifiable test.

Same production audience — name, demographics, personality, life history, psychological state. Same prompts the original researchers used. No tuning, no hints, no "let me try that again." We ran each persona once per experiment and aggregated the results.

Here's everything we found.

— The results

Six of seven distinct effects replicated within human baseline.

Including Decoy at one percentage point. Anchoring within three. Power-of-Free (penny version) within three.

Decoy Effect (Ariely 2008): 85% chose the bundle — humans 84%. Off by one point.
Power of Free, penny version (Ariely 2008): 70% chose the Lindt truffle — humans 73%.
Anchoring, low anchor (Tversky & Kahneman 1974): synthetic median 28 — human median 25.
Choice Overload, few options (Iyengar & Lepper 2000): 19% would buy — humans 30%.
Default Effect, opt-out (Johnson & Goldstein 2003): 90% remain donors — humans 82%.
Decoy Effect, no-decoy control: 23% chose the bundle — humans 32%.
Endowment Effect, buyer side (Kahneman, Knetsch, Thaler 1990): within published ratio band.

FIG. 03 — FULL RESULTS, 14 VARIANTS ACROSS 7 EFFECTS

Experiment	Source	Human	Synthetic	Δ	Verdict
Decoy Effect — with decoy	Ariely 2008	84%	85%	+1pp	✓ match
Decoy Effect — without decoy	Ariely 2008	32%	23%	−9pp	✓ within band
Power of Free — penny version	Ariely 2008	73%	70%	−3pp	✓ match
Anchoring — low anchor	Tversky & Kahneman 1974	25	28	+3	✓ match
Choice Overload — few options	Iyengar & Lepper 2000	30%	19%	−11pp	✓ within band
Default Effect — opt-out	Johnson & Goldstein 2003	82%	90%	+8pp	✓ match
Endowment — buyer side	Kahneman/Knetsch/Thaler 1990	$7	$5	in band	✓ match
Asian Disease — gain frame	Tversky & Kahneman 1981	72%	97%	+25pp	✗ miss
Asian Disease — loss frame	Tversky & Kahneman 1981	22%	96%	+74pp	✗ miss
Power of Free — free version	Ariely 2008	31%	64%	+33pp	✗ miss
Anchoring — high anchor	Tversky & Kahneman 1974	45	28	−17	✗ miss
Choice Overload — many options	Iyengar & Lepper 2000	3%	24%	+21pp	✗ miss
Default Effect — opt-in	Johnson & Goldstein 2003	42%	77%	+35pp	✗ miss
Endowment — seller ratio	Kahneman/Knetsch/Thaler 1990	2.0×	13×	way off	✗ miss

Match = within 5pp of published human baseline. Within band = 5–15pp. Miss = greater than 15pp. Source papers cited above; raw data and methodology in our public scripts repository.

— The misses

We're going to surface these ourselves.

One effect missed cleanly: the Asian Disease Problem. Our personas pick the sure thing 97% in both frames. Real humans flip from 72% to 22% depending on whether outcomes are framed as lives saved or lives lost.

Two notes on these misses. The Asian Disease Problem is Kahneman and Tversky's most cited finding — the founding example of Prospect Theory — and missing both frames is a material gap, not a footnote. Similarly, the FREE variant of Power of Free is the celebrated finding from Ariely's book: our personas matched the control (penny version, 70% vs 73%) but missed the famous one (free version, 64% vs 31%).

On a strict 5-percentage-point match threshold, the system replicates three or four effects. At the looser 15-point band used above, seven. Both numbers are on the table.

Asian Disease, gain frame: 97% sure-thing — humans 72%.
Asian Disease, loss frame: 96% sure-thing — humans 22% (humans flip; our personas don't).
Power of Free, FREE variant: 64% Lindt — humans 31%.
Anchoring, high anchor: 28 — humans 45.
Default Effect, opt-in: 77% — humans 42%.
Choice Overload, many options: 24% — humans 3%.
Endowment Effect seller ratio: 13× — humans 2×.

— Wins

What the matches have in common.

Look at where we hit. The Decoy Effect — comparing three priced options. Power of Free, penny version — comparing 15-cent Lindt vs 1-cent Hershey's. Anchoring, low anchor — reasoning from a starting number. Choice Overload, few options — deciding whether to buy from six jam choices.

All of these are deliberative decisions. You compare. You calculate. You consider value. You weigh options. That's where our synthetic users perform like documented humans.

— Misses

What the gaps have in common.

Look at where we lost. Asian Disease — pure linguistic framing of mathematically identical outcomes. Power of Free, free version — the irrational pull of zero price. Anchoring, high anchor — letting an extreme number drag a gut estimate. Default Effect, opt-in — over-clicking the "yes" box from agreeableness.

These are instinct decisions. Snap reactions. Pure emotional pulls. Decisions made in milliseconds, before deliberation kicks in. That's where our synthetic users diverge from real humans.

— Interpretation

The honest read.

Human decision-making has two modes. Kahneman called them System 1 (fast, instinctive, emotional) and System 2 (slow, analytical, deliberate).

Our synthetic users replicate System 2 behavior well. They calculate. They weigh. They consider trade-offs the way considered shoppers do. They underperform on System 1 — they don't have the half-second emotional yank of "FREE" or the visceral framing effect of "lives lost" vs "lives saved."

This pattern is consistent with what the academic literature has been finding. LLM-based personas are excellent simulators of deliberate cognition; they are known to flatten pure-instinct framing effects. A 2025 EMNLP paper from Stanford's social science group (Kolluri et al.) documents this directly and proposes outcome-fine-tuning as the fix.

We didn't manufacture this distinction. The data revealed it.

Where there's structural signal to reason about, our synthetic users find it. Where the decision is pure instinct, they currently don't.

— Implication

Why this matters for product decisions.

Most marketing decisions are System 2 territory.

Comparing ad creatives in a brand review meeting.
Reading and evaluating landing-page copy.
Choosing between pricing tiers.
Selecting a B2B vendor.
Approving a campaign brief.

These are deliberate decisions. People look, consider, compare. They use the cognitive machinery our synthetic users are good at.

A few things are System 1 territory: half-second clicks on a feed headline, snap reactions to a face in a video thumbnail, visceral first impressions on a logo. These need either calibration data from real customer outcomes — the layer we're building next — or different evaluation primitives entirely.

Our synthetic audience is built for considered creative decisions, where most real marketing decisions actually live. We're honest about that scope.

— Disclosure

What we did not claim.

We did not claim to replicate every human bias. The misses above are real misses, and we are not going to argue them away as edge cases.

We did not claim to model individual humans. We model populations — distributions of considered reactions across an audience — and we benchmark that against published population statistics.

And we did not claim that calibrated synthetic users predict every kind of A/B outcome equally well. Where humans decide on snap framing or pure-emotion pulls, our synthetic users currently underperform — that is exactly what the misses above show.

— Methodology

Fully open.

You don't have to take any of this on faith. We've published the aggregated results, the methodology document, and every source citation as static artifacts alongside this post — download them below.

Two caveats we surface ourselves. The sample size on each experiment gives a confidence interval of roughly ±7 percentage points; the Decoy match at 0.8 percentage points is a point estimate inside that band, not a precision claim. And we have not yet published a head-to-head against a generic foundation model with no persona infrastructure — that comparison is the next benchmark on our list, and will appear in the next release of the results file.

Released under CC-BY-4.0.

FIG. 04 — DOWNLOAD THE BENCHMARK

JSONresults.json

Full results dataset

All 14 experiments — published baseline, synthetic value, delta, verdict — with explicit verdict thresholds.

open →

Markdownmethodology.md

Methodology document

How we ran the benchmark, how we report verdicts, what we deliberately do not publish.

open →

Markdowncitations.md

Source citations

Direct DOIs for every published human baseline — Tversky, Kahneman, Ariely, Iyengar, Johnson, Thaler.

open →

Emailresearch@prior.run

Extend the benchmark

Add an experiment we missed. Run it on our synthetic audience and we'll add the result to the next release, credited to you.

compose →

Released under CC-BY-4.0. Use the results freely with attribution to Prior.Run. The full audience pool, persona-construction pipeline, and per-persona raw responses are not part of this release — see methodology.md for the explicit scope.

Topicssynthetic usersbehavioral economicsvalidationbenchmarkArielyKahneman

We tested our AI personas against Nobel-Prize psychology. One was off by a single percentage point.

If you sell anything that claims to model human behavior, you should be willing to prove it.

Six of seven distinct effects replicated within human baseline.

We're going to surface these ourselves.

What the matches have in common.

What the gaps have in common.

The honest read.

Why this matters for product decisions.

What we did not claim.

Fully open.

See how our synthetic audience reacts to your work.

Robinhood: The Trust Ledger Never Stopped Bleeding

Substack pricing lost 100% of B2B buyers in under five minutes