Nov 15, 2025
TypeBench: Measuring Human Taste
By TypeOS Research
Read PaperThe first benchmark specifically measuring human taste in writing, derived from real editing behavior, not artificial ratings. TypeBench also serves as the primary evaluation corpus for TypeOS AI detection accuracy, spanning 50,000+ human-authored and AI-generated documents.
TypeBench evaluates models on style alignment, tone control, document structure, argument quality, personalization — and provides the ground-truth corpus used to validate TypeOS AI detection and essay grading systems.
Corpus Overview
Documents span academic essays (undergraduate and graduate level), blog posts, business reports, and creative writing. AI-generated documents were produced by GPT-3.5, GPT-4, GPT-4o, Claude 3 Sonnet, Claude 3.5 Sonnet, Gemini 1.5 Pro, and LLaMA 3.
AI Detection Evaluation Methodology
TypeBench provides the evaluation corpus and ground-truth labels used to measure AI detection performance. Detection accuracy is calculated as the percentage of correctly classified documents (AI-generated correctly identified as AI; human-written correctly identified as human) across the full 50,000+ document corpus.
False positive rate is defined as the fraction of human-authored documents that a given detector incorrectly flags as AI-generated — the metric most consequential for users who risk being falsely accused.
| Detector | Detection Accuracy | False Positive Rate | Docs Evaluated |
|---|---|---|---|
| TypeOS (multi-model) | 99.98% | 1.2% | 50,000+ |
| GPTZero | 86% | 9% | Not disclosed |
| Turnitin AI Detection | 84% | 11% | Not disclosed |
| Single-model baseline | ~85% | ~8.5% | 50,000+ |
TypeOS figures are sourced from this TypeBench internal evaluation. GPTZero and Turnitin figures represent publicly available third-party benchmarks and disclosed figures where available. Competitor evaluation sets and methodologies were not available for direct comparison.
Essay Grading Evaluation Methodology
TypeBench includes 10,000+ essays graded by both the TypeOS AI grader and a panel of certified human raters (professional educators with 5+ years of grading experience). Inter-rater reliability among human graders was established at Cohen's κ = 0.81 before AI comparison began.
Grading correlation is reported as Pearson's r between AI-assigned holistic scores and the mean human rater score. TypeOS AI Essay Grader achieved r = 0.94 across the full evaluation set, with consistent performance across essay types (argumentative, expository, research) and academic levels (high school, undergraduate, graduate).
| Essay Type | N (essays) | Pearson r | Avg. Grading Time |
|---|---|---|---|
| Argumentative | 3,200 | 0.95 | 24s |
| Expository | 2,800 | 0.93 | 21s |
| Research Paper | 2,400 | 0.94 | 31s |
| Personal Statement | 1,600 | 0.91 | 18s |
| All Types (combined) | 10,000+ | 0.94 | <30s |
Writing Taste Evaluation Methodology
Unlike static benchmarks that rely on multiple-choice questions or abstract reasoning, TypeBench uses "Taste Tasks" — complex rewriting and drafting instructions that require stylistic nuance.
We use Bradley-Terry scoring to rank models based on pairwise human preference comparisons from expert editors.
- Rewrite: Transform a rough email into a polished executive summary.
- Tone Shift: Change a defensive message to an apologetic yet firm one.
- Style Mimicry: Write a paragraph in the style of The Economist.
- Summarize: Condense a legal brief without losing key clauses.
Writing Taste Results (HPSR Leaderboard)
| Rank | Model | HPSR Score |
|---|---|---|
| Hemingway 1 | 89.4 | |
| Claude 3.5 Sonnet | 84.2 | |
| GPT-4o | 82.8 |
HPSR = Human Preference Score Ranking. Scored via Bradley-Terry model on pairwise comparisons by 120 expert editors.


