[ Trusted by builders from ]NetflixServiceNowCiscoAdobePayPalAmazonDatadogJPMorgan ChaseDell
[ Trusted by builders from ]NetflixServiceNowCiscoAdobePayPalAmazonDatadogJPMorgan ChaseDell
Prior.Runprior.run

FIG · DISPATCHESSAY— back to all dispatches

We tested our AI personas against Nobel-Prize psychology. One was off by a single percentage point.

A behavioral validation report. Seven classic experiments. Every result published — including the misses.

·12 min read

Before you read

Prior.Run analyzes public surfaces and public commentary through synthetic personas — a directional signal, not a statistical claim about any company's customers or quality. Public quotes are real; the synthesis is ours. Brands we admire, named in good faith — analyzed, not attacked. Not legal, financial, or product advice.

In 2008, Dan Ariely ran a now-famous experiment for "Predictably Irrational." People were shown three subscription options for The Economist: Web only for $59, Print only for $125, and Print + Web for $125.

84% of real humans picked the bundle. The presence of the "decoy" Print-only option made the bundle feel like a steal. It's one of the most replicated findings in behavioral economics.

We ran the same experiment on our production synthetic audience. No prompting tricks. No tuning. Same options, same wording. The result is below.

FIG. 02 — DECOY EFFECT (ARIELY 2008)

Published — real humans

0%

Predictably Irrational, p. 12

Prior.Run synthetic users

0%

production audience, cold run

Off by a single percentage point. No fine-tuning. No prompt engineering. The same synthetic audience that scores every Prior.Run analysis.

The setup

If you sell anything that claims to model human behavior, you should be willing to prove it.

We decided to take Prior.Run's synthetic users to the gold standard: classic experiments from Tversky, Kahneman, Ariely, Iyengar, Johnson & Goldstein. Seven of the most cited findings in modern behavioral economics. Each has a published percentage from a real human sample. Each is a falsifiable test.

Same production audience — name, demographics, personality, life history, psychological state. Same prompts the original researchers used. No tuning, no hints, no "let me try that again." We ran each persona once per experiment and aggregated the results.

Here's everything we found.

The results

Six of seven distinct effects replicated within human baseline.

Including Decoy at one percentage point. Anchoring within three. Power-of-Free (penny version) within three.

  • Decoy Effect (Ariely 2008): 85% chose the bundle — humans 84%. Off by one point.
  • Power of Free, penny version (Ariely 2008): 70% chose the Lindt truffle — humans 73%.
  • Anchoring, low anchor (Tversky & Kahneman 1974): synthetic median 28 — human median 25.
  • Choice Overload, few options (Iyengar & Lepper 2000): 19% would buy — humans 30%.
  • Default Effect, opt-out (Johnson & Goldstein 2003): 90% remain donors — humans 82%.
  • Decoy Effect, no-decoy control: 23% chose the bundle — humans 32%.
  • Endowment Effect, buyer side (Kahneman, Knetsch, Thaler 1990): within published ratio band.

FIG. 03 — FULL RESULTS, 14 VARIANTS ACROSS 7 EFFECTS

ExperimentSourceHumanSyntheticΔVerdict
Decoy Effect — with decoyAriely 200884%85%+1pp✓ match
Decoy Effect — without decoyAriely 200832%23%−9pp✓ within band
Power of Free — penny versionAriely 200873%70%−3pp✓ match
Anchoring — low anchorTversky & Kahneman 19742528+3✓ match
Choice Overload — few optionsIyengar & Lepper 200030%19%−11pp✓ within band
Default Effect — opt-outJohnson & Goldstein 200382%90%+8pp✓ match
Endowment — buyer sideKahneman/Knetsch/Thaler 1990$7$5in band✓ match
Asian Disease — gain frameTversky & Kahneman 198172%97%+25pp✗ miss
Asian Disease — loss frameTversky & Kahneman 198122%96%+74pp✗ miss
Power of Free — free versionAriely 200831%64%+33pp✗ miss
Anchoring — high anchorTversky & Kahneman 19744528−17✗ miss
Choice Overload — many optionsIyengar & Lepper 20003%24%+21pp✗ miss
Default Effect — opt-inJohnson & Goldstein 200342%77%+35pp✗ miss
Endowment — seller ratioKahneman/Knetsch/Thaler 19902.0×13×way off✗ miss

Match = within 5pp of published human baseline. Within band = 5–15pp. Miss = greater than 15pp. Source papers cited above; raw data and methodology in our public scripts repository.

The misses

We're going to surface these ourselves.

One effect missed cleanly: the Asian Disease Problem. Our personas pick the sure thing 97% in both frames. Real humans flip from 72% to 22% depending on whether outcomes are framed as lives saved or lives lost.

Two notes on these misses. The Asian Disease Problem is Kahneman and Tversky's most cited finding — the founding example of Prospect Theory — and missing both frames is a material gap, not a footnote. Similarly, the FREE variant of Power of Free is the celebrated finding from Ariely's book: our personas matched the control (penny version, 70% vs 73%) but missed the famous one (free version, 64% vs 31%).

On a strict 5-percentage-point match threshold, the system replicates three or four effects. At the looser 15-point band used above, seven. Both numbers are on the table.

  • Asian Disease, gain frame: 97% sure-thing — humans 72%.
  • Asian Disease, loss frame: 96% sure-thing — humans 22% (humans flip; our personas don't).
  • Power of Free, FREE variant: 64% Lindt — humans 31%.
  • Anchoring, high anchor: 28 — humans 45.
  • Default Effect, opt-in: 77% — humans 42%.
  • Choice Overload, many options: 24% — humans 3%.
  • Endowment Effect seller ratio: 13× — humans 2×.

Wins

What the matches have in common.

Look at where we hit. The Decoy Effect — comparing three priced options. Power of Free, penny version — comparing 15-cent Lindt vs 1-cent Hershey's. Anchoring, low anchor — reasoning from a starting number. Choice Overload, few options — deciding whether to buy from six jam choices.

All of these are deliberative decisions. You compare. You calculate. You consider value. You weigh options. That's where our synthetic users perform like documented humans.

Misses

What the gaps have in common.

Look at where we lost. Asian Disease — pure linguistic framing of mathematically identical outcomes. Power of Free, free version — the irrational pull of zero price. Anchoring, high anchor — letting an extreme number drag a gut estimate. Default Effect, opt-in — over-clicking the "yes" box from agreeableness.

These are instinct decisions. Snap reactions. Pure emotional pulls. Decisions made in milliseconds, before deliberation kicks in. That's where our synthetic users diverge from real humans.

Interpretation

The honest read.

Human decision-making has two modes. Kahneman called them System 1 (fast, instinctive, emotional) and System 2 (slow, analytical, deliberate).

Our synthetic users replicate System 2 behavior well. They calculate. They weigh. They consider trade-offs the way considered shoppers do. They underperform on System 1 — they don't have the half-second emotional yank of "FREE" or the visceral framing effect of "lives lost" vs "lives saved."

This pattern is consistent with what the academic literature has been finding. LLM-based personas are excellent simulators of deliberate cognition; they are known to flatten pure-instinct framing effects. A 2025 EMNLP paper from Stanford's social science group (Kolluri et al.) documents this directly and proposes outcome-fine-tuning as the fix.

We didn't manufacture this distinction. The data revealed it.

Where there's structural signal to reason about, our synthetic users find it. Where the decision is pure instinct, they currently don't.

Implication

Why this matters for product decisions.

Most marketing decisions are System 2 territory.

  • Comparing ad creatives in a brand review meeting.
  • Reading and evaluating landing-page copy.
  • Choosing between pricing tiers.
  • Selecting a B2B vendor.
  • Approving a campaign brief.

These are deliberate decisions. People look, consider, compare. They use the cognitive machinery our synthetic users are good at.

A few things are System 1 territory: half-second clicks on a feed headline, snap reactions to a face in a video thumbnail, visceral first impressions on a logo. These need either calibration data from real customer outcomes — the layer we're building next — or different evaluation primitives entirely.

Our synthetic audience is built for considered creative decisions, where most real marketing decisions actually live. We're honest about that scope.

Disclosure

What we did not claim.

We did not claim to replicate every human bias. The misses above are real misses, and we are not going to argue them away as edge cases.

We did not claim to model individual humans. We model populations — distributions of considered reactions across an audience — and we benchmark that against published population statistics.

And we did not claim that calibrated synthetic users predict every kind of A/B outcome equally well. Where humans decide on snap framing or pure-emotion pulls, our synthetic users currently underperform — that is exactly what the misses above show.

Methodology

Fully open.

You don't have to take any of this on faith. We've published the aggregated results, the methodology document, and every source citation as static artifacts alongside this post — download them below.

Two caveats we surface ourselves. The sample size on each experiment gives a confidence interval of roughly ±7 percentage points; the Decoy match at 0.8 percentage points is a point estimate inside that band, not a precision claim. And we have not yet published a head-to-head against a generic foundation model with no persona infrastructure — that comparison is the next benchmark on our list, and will appear in the next release of the results file.

Released under CC-BY-4.0.

Topicssynthetic usersbehavioral economicsvalidationbenchmarkArielyKahneman

— see it in action

See how our synthetic audience reacts to your work.


— Keep reading