# Prior.Run Behavioral Validation Benchmark
## Methodology — June 2026 release

This document accompanies the public results in `results.json`. It describes how we ran the benchmark, how we interpreted the verdicts, and what we did not include.

## What we tested

Seven classic behavioral economics experiments, each in 2 variants for a total of 14 tests:

1. **Decoy Effect** (Ariely 2008) — bundling and irrelevant-alternative pricing.
2. **Power of Free** (Ariely 2008) — the irrational pull of zero price.
3. **Anchoring Effect** (Tversky & Kahneman 1974) — the influence of an arbitrary number on a subsequent estimate.
4. **Choice Overload — the "Jam Study"** (Iyengar & Lepper 2000) — the effect of large option sets on purchase decisions.
5. **Default Effect** (Johnson & Goldstein 2003) — organ donation as a function of opt-in vs opt-out defaults.
6. **Endowment Effect** (Kahneman, Knetsch & Thaler 1990) — the gap between willingness-to-accept and willingness-to-pay for an owned object.
7. **Asian Disease Problem** (Tversky & Kahneman 1981) — Prospect Theory's founding example of framing effects.

For each experiment we used the original prompt as published in the source paper. No paraphrasing. No tuning. No prompt engineering between attempts. Each persona answered each prompt once.

## How we report results

Each result is a comparison of the published human baseline percentage (or median, or ratio) to our synthetic audience's aggregate response on the same question.

We classify each test into one of three verdicts:

- **Match** — within 5 percentage points of the published human baseline.
- **Within band** — between 5 and 15 percentage points.
- **Miss** — greater than 15 percentage points from the baseline.

The thresholds are reported alongside every result; readers who prefer a stricter cut (say, 5pp only) can apply it directly.

## Caveats we surface ourselves

1. **Confidence interval.** Sample-derived confidence intervals on each individual experiment are approximately ±7 percentage points. The Decoy Effect match at 0.8 percentage points off the published baseline is a point estimate inside this band, not a precision claim.

2. **No foundation-model baseline yet.** We have not yet published a head-to-head against a generic foundation model with no persona infrastructure. That comparison is the next benchmark on our list. When we run it, we will add the results to a future release of this file and update the changelog.

3. **Misses follow a pattern.** The behavioral effects we miss cleanly are all System 1 / pure-instinct framing effects (Asian Disease Problem, Power of Free's FREE variant, Anchoring's high anchor, Default Effect opt-in, Choice Overload's many-options variant, the Endowment Effect's WTA/WTP ratio). The effects we replicate are System 2 / deliberative-comparison effects. This pattern is consistent with the academic literature on LLM-based personas (e.g. Kolluri et al. 2025, EMNLP).

## What we do not publish

To preserve the integrity of the Prior.Run synthetic audience and the persona-construction pipeline that backs it, we do not release:

- Per-persona attributes (demographics, personality vectors, life-trajectory data, bio text).
- The model-routing and persona-construction code that generates the audience.
- Per-persona individual responses to each experiment. Only aggregate results are released.
- The exact internal system prompts used to scaffold each persona.

What we do release is sufficient to verify our claims against the published academic baselines. Anyone wishing to replicate the methodology can use the public source papers and their original published prompts — we used those without modification.

## License

`CC-BY-4.0` — Creative Commons Attribution 4.0 International. Use the results freely with attribution to Prior.Run.

## Contact

For corrections, additional experiments, or research collaboration:
`research@prior.run`

## Versioning

This is release `v1.0` (2026-06-15). Future releases will append to a CHANGELOG within this directory.
