In 2008, Dan Ariely ran a now-famous experiment for "Predictably Irrational." People were shown three subscription options for The Economist: Web only for $59, Print only for $125, and Print + Web for $125.
84% of real humans picked the bundle. The presence of the "decoy" Print-only option made the bundle feel like a steal. It's one of the most replicated findings in behavioral economics.
We ran the same experiment on our production synthetic audience. No prompting tricks. No tuning. Same options, same wording. The result is below.
— FIG. 02 — DECOY EFFECT (ARIELY 2008)
Published — real humans
0%
Predictably Irrational, p. 12
Prior.Run synthetic users
0%
production audience, cold run
— Off by a single percentage point. No fine-tuning. No prompt engineering. The same synthetic audience that scores every Prior.Run analysis.
— The setup
If you sell anything that claims to model human behavior, you should be willing to prove it.
We decided to take Prior.Run's synthetic users to the gold standard: classic experiments from Tversky, Kahneman, Ariely, Iyengar, Johnson & Goldstein. Seven of the most cited findings in modern behavioral economics. Each has a published percentage from a real human sample. Each is a falsifiable test.
Same production audience — name, demographics, personality, life history, psychological state. Same prompts the original researchers used. No tuning, no hints, no "let me try that again." We ran each persona once per experiment and aggregated the results.
Here's everything we found.
— The results
Six of seven distinct effects replicated within human baseline.
Including Decoy at one percentage point. Anchoring within three. Power-of-Free (penny version) within three.
- Decoy Effect (Ariely 2008): 85% chose the bundle — humans 84%. Off by one point.
- Power of Free, penny version (Ariely 2008): 70% chose the Lindt truffle — humans 73%.
- Anchoring, low anchor (Tversky & Kahneman 1974): synthetic median 28 — human median 25.
- Choice Overload, few options (Iyengar & Lepper 2000): 19% would buy — humans 30%.
- Default Effect, opt-out (Johnson & Goldstein 2003): 90% remain donors — humans 82%.
- Decoy Effect, no-decoy control: 23% chose the bundle — humans 32%.
- Endowment Effect, buyer side (Kahneman, Knetsch, Thaler 1990): within published ratio band.
FIG. 03 — FULL RESULTS, 14 VARIANTS ACROSS 7 EFFECTS
| Experiment | Source | Human | Synthetic | Δ | Verdict |
|---|---|---|---|---|---|
| Decoy Effect — with decoy | Ariely 2008 | 84% | 85% | +1pp | ✓ match |
| Decoy Effect — without decoy | Ariely 2008 | 32% | 23% | −9pp | ✓ within band |
| Power of Free — penny version | Ariely 2008 | 73% | 70% | −3pp | ✓ match |
| Anchoring — low anchor | Tversky & Kahneman 1974 | 25 | 28 | +3 | ✓ match |
| Choice Overload — few options | Iyengar & Lepper 2000 | 30% | 19% | −11pp | ✓ within band |
| Default Effect — opt-out | Johnson & Goldstein 2003 | 82% | 90% | +8pp | ✓ match |
| Endowment — buyer side | Kahneman/Knetsch/Thaler 1990 | $7 | $5 | in band | ✓ match |
| Asian Disease — gain frame | Tversky & Kahneman 1981 | 72% | 97% | +25pp | ✗ miss |
| Asian Disease — loss frame | Tversky & Kahneman 1981 | 22% | 96% | +74pp | ✗ miss |
| Power of Free — free version | Ariely 2008 | 31% | 64% | +33pp | ✗ miss |
| Anchoring — high anchor | Tversky & Kahneman 1974 | 45 | 28 | −17 | ✗ miss |
| Choice Overload — many options | Iyengar & Lepper 2000 | 3% | 24% | +21pp | ✗ miss |
| Default Effect — opt-in | Johnson & Goldstein 2003 | 42% | 77% | +35pp | ✗ miss |
| Endowment — seller ratio | Kahneman/Knetsch/Thaler 1990 | 2.0× | 13× | way off | ✗ miss |
Match = within 5pp of published human baseline. Within band = 5–15pp. Miss = greater than 15pp. Source papers cited above; raw data and methodology in our public scripts repository.
— The misses
We're going to surface these ourselves.
One effect missed cleanly: the Asian Disease Problem. Our personas pick the sure thing 97% in both frames. Real humans flip from 72% to 22% depending on whether outcomes are framed as lives saved or lives lost.
Two notes on these misses. The Asian Disease Problem is Kahneman and Tversky's most cited finding — the founding example of Prospect Theory — and missing both frames is a material gap, not a footnote. Similarly, the FREE variant of Power of Free is the celebrated finding from Ariely's book: our personas matched the control (penny version, 70% vs 73%) but missed the famous one (free version, 64% vs 31%).
On a strict 5-percentage-point match threshold, the system replicates three or four effects. At the looser 15-point band used above, seven. Both numbers are on the table.
- Asian Disease, gain frame: 97% sure-thing — humans 72%.
- Asian Disease, loss frame: 96% sure-thing — humans 22% (humans flip; our personas don't).
- Power of Free, FREE variant: 64% Lindt — humans 31%.
- Anchoring, high anchor: 28 — humans 45.
- Default Effect, opt-in: 77% — humans 42%.
- Choice Overload, many options: 24% — humans 3%.
- Endowment Effect seller ratio: 13× — humans 2×.
— Wins
What the matches have in common.
Look at where we hit. The Decoy Effect — comparing three priced options. Power of Free, penny version — comparing 15-cent Lindt vs 1-cent Hershey's. Anchoring, low anchor — reasoning from a starting number. Choice Overload, few options — deciding whether to buy from six jam choices.
All of these are deliberative decisions. You compare. You calculate. You consider value. You weigh options. That's where our synthetic users perform like documented humans.
— Misses
What the gaps have in common.
Look at where we lost. Asian Disease — pure linguistic framing of mathematically identical outcomes. Power of Free, free version — the irrational pull of zero price. Anchoring, high anchor — letting an extreme number drag a gut estimate. Default Effect, opt-in — over-clicking the "yes" box from agreeableness.
These are instinct decisions. Snap reactions. Pure emotional pulls. Decisions made in milliseconds, before deliberation kicks in. That's where our synthetic users diverge from real humans.
— Interpretation
The honest read.
Human decision-making has two modes. Kahneman called them System 1 (fast, instinctive, emotional) and System 2 (slow, analytical, deliberate).
Our synthetic users replicate System 2 behavior well. They calculate. They weigh. They consider trade-offs the way considered shoppers do. They underperform on System 1 — they don't have the half-second emotional yank of "FREE" or the visceral framing effect of "lives lost" vs "lives saved."
This pattern is consistent with what the academic literature has been finding. LLM-based personas are excellent simulators of deliberate cognition; they are known to flatten pure-instinct framing effects. A 2025 EMNLP paper from Stanford's social science group (Kolluri et al.) documents this directly and proposes outcome-fine-tuning as the fix.
We didn't manufacture this distinction. The data revealed it.
Where there's structural signal to reason about, our synthetic users find it. Where the decision is pure instinct, they currently don't.
— Implication
Why this matters for product decisions.
Most marketing decisions are System 2 territory.
- Comparing ad creatives in a brand review meeting.
- Reading and evaluating landing-page copy.
- Choosing between pricing tiers.
- Selecting a B2B vendor.
- Approving a campaign brief.
These are deliberate decisions. People look, consider, compare. They use the cognitive machinery our synthetic users are good at.
A few things are System 1 territory: half-second clicks on a feed headline, snap reactions to a face in a video thumbnail, visceral first impressions on a logo. These need either calibration data from real customer outcomes — the layer we're building next — or different evaluation primitives entirely.
Our synthetic audience is built for considered creative decisions, where most real marketing decisions actually live. We're honest about that scope.
— Disclosure
What we did not claim.
We did not claim to replicate every human bias. The misses above are real misses, and we are not going to argue them away as edge cases.
We did not claim to model individual humans. We model populations — distributions of considered reactions across an audience — and we benchmark that against published population statistics.
And we did not claim that calibrated synthetic users predict every kind of A/B outcome equally well. Where humans decide on snap framing or pure-emotion pulls, our synthetic users currently underperform — that is exactly what the misses above show.
— Methodology
Fully open.
You don't have to take any of this on faith. We've published the aggregated results, the methodology document, and every source citation as static artifacts alongside this post — download them below.
Two caveats we surface ourselves. The sample size on each experiment gives a confidence interval of roughly ±7 percentage points; the Decoy match at 0.8 percentage points is a point estimate inside that band, not a precision claim. And we have not yet published a head-to-head against a generic foundation model with no persona infrastructure — that comparison is the next benchmark on our list, and will appear in the next release of the results file.
Released under CC-BY-4.0.
FIG. 04 — DOWNLOAD THE BENCHMARK
Full results dataset
All 14 experiments — published baseline, synthetic value, delta, verdict — with explicit verdict thresholds.
open →
Methodology document
How we ran the benchmark, how we report verdicts, what we deliberately do not publish.
open →
Source citations
Direct DOIs for every published human baseline — Tversky, Kahneman, Ariely, Iyengar, Johnson, Thaler.
open →
Extend the benchmark
Add an experiment we missed. Run it on our synthetic audience and we'll add the result to the next release, credited to you.
compose →
Released under CC-BY-4.0. Use the results freely with attribution to Prior.Run. The full audience pool, persona-construction pipeline, and per-persona raw responses are not part of this release — see methodology.md for the explicit scope.