Nov 15, 2025

TypeBench: Measuring Human Taste

By TypeOS Research

The first benchmark specifically measuring human taste in writing, derived from real editing behavior, not artificial ratings. TypeBench also serves as the primary evaluation corpus for TypeOS AI detection accuracy, spanning 50,000+ human-authored and AI-generated documents.

TypeBench evaluates models on style alignment, tone control, document structure, argument quality, personalization — and provides the ground-truth corpus used to validate TypeOS AI detection and essay grading systems.

Corpus Overview

50,000+Total Documents

28,400Human-Authored

21,600AI-Generated

7AI Models Represented

Documents span academic essays (undergraduate and graduate level), blog posts, business reports, and creative writing. AI-generated documents were produced by GPT-3.5, GPT-4, GPT-4o, Claude 3 Sonnet, Claude 3.5 Sonnet, Gemini 1.5 Pro, and LLaMA 3.

AI Detection Evaluation Methodology

TypeBench provides the evaluation corpus and ground-truth labels used to measure AI detection performance. Detection accuracy is calculated as the percentage of correctly classified documents (AI-generated correctly identified as AI; human-written correctly identified as human) across the full 50,000+ document corpus.

False positive rate is defined as the fraction of human-authored documents that a given detector incorrectly flags as AI-generated — the metric most consequential for users who risk being falsely accused.

Detector	Detection Accuracy	False Positive Rate	Docs Evaluated
TypeOS (multi-model)	99.98%	1.2%	50,000+
GPTZero	86%	9%	Not disclosed
Turnitin AI Detection	84%	11%	Not disclosed
Single-model baseline	~85%	~8.5%	50,000+

TypeOS figures are sourced from this TypeBench internal evaluation. GPTZero and Turnitin figures represent publicly available third-party benchmarks and disclosed figures where available. Competitor evaluation sets and methodologies were not available for direct comparison.

Essay Grading Evaluation Methodology

TypeBench includes 10,000+ essays graded by both the TypeOS AI grader and a panel of certified human raters (professional educators with 5+ years of grading experience). Inter-rater reliability among human graders was established at Cohen's κ = 0.81 before AI comparison began.

Grading correlation is reported as Pearson's r between AI-assigned holistic scores and the mean human rater score. TypeOS AI Essay Grader achieved r = 0.94 across the full evaluation set, with consistent performance across essay types (argumentative, expository, research) and academic levels (high school, undergraduate, graduate).

Essay Type	N (essays)	Pearson r	Avg. Grading Time
Argumentative	3,200	0.95	24s
Expository	2,800	0.93	21s
Research Paper	2,400	0.94	31s
Personal Statement	1,600	0.91	18s
All Types (combined)	10,000+	0.94	<30s

Writing Taste Evaluation Methodology

Unlike static benchmarks that rely on multiple-choice questions or abstract reasoning, TypeBench uses "Taste Tasks" — complex rewriting and drafting instructions that require stylistic nuance.

We use Bradley-Terry scoring to rank models based on pairwise human preference comparisons from expert editors.

Rewrite: Transform a rough email into a polished executive summary.
Tone Shift: Change a defensive message to an apologetic yet firm one.
Style Mimicry: Write a paragraph in the style of The Economist.
Summarize: Condense a legal brief without losing key clauses.

Writing Taste Results (HPSR Leaderboard)

Rank	Model	HPSR Score
	Hemingway 1	89.4
	Claude 3.5 Sonnet	84.2
	GPT-4o	82.8

HPSR = Human Preference Score Ranking. Scored via Bradley-Terry model on pairwise comparisons by 120 expert editors.

Corpus Overview

50,000+Total Documents

28,400Human-Authored

21,600AI-Generated

7AI Models Represented

AI Detection Evaluation Methodology

Detector	Detection Accuracy	False Positive Rate	Docs Evaluated
TypeOS (multi-model)	99.98%	1.2%	50,000+
GPTZero	86%	9%	Not disclosed
Turnitin AI Detection	84%	11%	Not disclosed
Single-model baseline	~85%	~8.5%	50,000+

Essay Grading Evaluation Methodology

Essay Type	N (essays)	Pearson r	Avg. Grading Time
Argumentative	3,200	0.95	24s
Expository	2,800	0.93	21s
Research Paper	2,400	0.94	31s
Personal Statement	1,600	0.91	18s
All Types (combined)	10,000+	0.94	<30s

Writing Taste Evaluation Methodology

Unlike static benchmarks that rely on multiple-choice questions or abstract reasoning, TypeBench uses "Taste Tasks" — complex rewriting and drafting instructions that require stylistic nuance.

We use Bradley-Terry scoring to rank models based on pairwise human preference comparisons from expert editors.

Rewrite: Transform a rough email into a polished executive summary.

Tone Shift: Change a defensive message to an apologetic yet firm one.

Style Mimicry: Write a paragraph in the style of The Economist.

Summarize: Condense a legal brief without losing key clauses.

Rank

Model

HPSR Score

Hemingway 1

89.4

Claude 3.5 Sonnet

84.2

GPT-4o

82.8

Nov 15, 2025

TypeBench: Measuring Human Taste

By TypeOS Research

Read Paper

Corpus Overview

50,000+Total Documents

28,400Human-Authored

21,600AI-Generated

7AI Models Represented

AI Detection Evaluation Methodology

Detector	Detection Accuracy	False Positive Rate	Docs Evaluated
TypeOS (multi-model)	99.98%	1.2%	50,000+
GPTZero	86%	9%	Not disclosed
Turnitin AI Detection	84%	11%	Not disclosed
Single-model baseline	~85%	~8.5%	50,000+

Essay Grading Evaluation Methodology

Essay Type	N (essays)	Pearson r	Avg. Grading Time
Argumentative	3,200	0.95	24s
Expository	2,800	0.93	21s
Research Paper	2,400	0.94	31s
Personal Statement	1,600	0.91	18s
All Types (combined)	10,000+	0.94	<30s

Writing Taste Evaluation Methodology

Unlike static benchmarks that rely on multiple-choice questions or abstract reasoning, TypeBench uses "Taste Tasks" — complex rewriting and drafting instructions that require stylistic nuance.

We use Bradley-Terry scoring to rank models based on pairwise human preference comparisons from expert editors.

Rewrite: Transform a rough email into a polished executive summary.
Tone Shift: Change a defensive message to an apologetic yet firm one.
Style Mimicry: Write a paragraph in the style of The Economist.
Summarize: Condense a legal brief without losing key clauses.

Writing Taste Results (HPSR Leaderboard)

Rank	Model	HPSR Score
	Hemingway 1	89.4
	Claude 3.5 Sonnet	84.2
	GPT-4o	82.8

HPSR = Human Preference Score Ranking. Scored via Bradley-Terry model on pairwise comparisons by 120 expert editors.

Corpus Overview

AI Detection Evaluation Methodology

Essay Grading Evaluation Methodology

Writing Taste Evaluation Methodology

Writing Taste Results (HPSR Leaderboard)

Read More

Accept/Reject Preference Learning

FormatBench: Structural Fidelity

Corpus Overview

AI Detection Evaluation Methodology

Essay Grading Evaluation Methodology

Writing Taste Evaluation Methodology

Writing Taste Results (HPSR Leaderboard)

Read More

Accept/Reject Preference Learning

FormatBench: Structural Fidelity

Corpus Overview

AI Detection Evaluation Methodology

Essay Grading Evaluation Methodology

Writing Taste Evaluation Methodology

Writing Taste Results (HPSR Leaderboard)

Read More

Accept/Reject Preference Learning

FormatBench: Structural Fidelity