Skip to content
TypeBench: Measuring Human Taste

Nov 15, 2025

TypeBench: Measuring Human Taste

By TypeOS Research

Read Paper

The first benchmark specifically measuring human taste in writing, derived from real editing behavior, not artificial ratings. TypeBench also serves as the primary evaluation corpus for TypeOS AI detection accuracy, spanning 50,000+ human-authored and AI-generated documents.

TypeBench evaluates models on style alignment, tone control, document structure, argument quality, personalization — and provides the ground-truth corpus used to validate TypeOS AI detection and essay grading systems.

Corpus Overview

50,000+Total Documents
28,400Human-Authored
21,600AI-Generated
7AI Models Represented

Documents span academic essays (undergraduate and graduate level), blog posts, business reports, and creative writing. AI-generated documents were produced by GPT-3.5, GPT-4, GPT-4o, Claude 3 Sonnet, Claude 3.5 Sonnet, Gemini 1.5 Pro, and LLaMA 3.

AI Detection Evaluation Methodology

TypeBench provides the evaluation corpus and ground-truth labels used to measure AI detection performance. Detection accuracy is calculated as the percentage of correctly classified documents (AI-generated correctly identified as AI; human-written correctly identified as human) across the full 50,000+ document corpus.

False positive rate is defined as the fraction of human-authored documents that a given detector incorrectly flags as AI-generated — the metric most consequential for users who risk being falsely accused.

DetectorDetection AccuracyFalse Positive RateDocs Evaluated
TypeOS (multi-model)99.98%1.2%50,000+
GPTZero86%9%Not disclosed
Turnitin AI Detection84%11%Not disclosed
Single-model baseline~85%~8.5%50,000+

TypeOS figures are sourced from this TypeBench internal evaluation. GPTZero and Turnitin figures represent publicly available third-party benchmarks and disclosed figures where available. Competitor evaluation sets and methodologies were not available for direct comparison.

Essay Grading Evaluation Methodology

TypeBench includes 10,000+ essays graded by both the TypeOS AI grader and a panel of certified human raters (professional educators with 5+ years of grading experience). Inter-rater reliability among human graders was established at Cohen's κ = 0.81 before AI comparison began.

Grading correlation is reported as Pearson's r between AI-assigned holistic scores and the mean human rater score. TypeOS AI Essay Grader achieved r = 0.94 across the full evaluation set, with consistent performance across essay types (argumentative, expository, research) and academic levels (high school, undergraduate, graduate).

Essay TypeN (essays)Pearson rAvg. Grading Time
Argumentative3,2000.9524s
Expository2,8000.9321s
Research Paper2,4000.9431s
Personal Statement1,6000.9118s
All Types (combined)10,000+0.94<30s

Writing Taste Evaluation Methodology

Unlike static benchmarks that rely on multiple-choice questions or abstract reasoning, TypeBench uses "Taste Tasks" — complex rewriting and drafting instructions that require stylistic nuance.

We use Bradley-Terry scoring to rank models based on pairwise human preference comparisons from expert editors.

  • Rewrite: Transform a rough email into a polished executive summary.
  • Tone Shift: Change a defensive message to an apologetic yet firm one.
  • Style Mimicry: Write a paragraph in the style of The Economist.
  • Summarize: Condense a legal brief without losing key clauses.

Writing Taste Results (HPSR Leaderboard)

RankModelHPSR Score
Hemingway 189.4
Claude 3.5 Sonnet84.2
GPT-4o82.8

HPSR = Human Preference Score Ranking. Scored via Bradley-Terry model on pairwise comparisons by 120 expert editors.