Women's Health
Benchmark

Large language models are increasingly consulted for medical information, yet no widely adopted benchmark evaluates their performance on women's health. WHBench introduces 47 expert-crafted clinical scenarios across 10 topics and evaluates 22 models with a 23-criterion safety-weighted rubric. Across 3,100 scored responses, no model mean exceeds 75 percent (top: 72.1 percent), and the frontier tier remained tightly clustered, suggesting a capability ceiling not visible in saturated benchmarks. Performance was also uneven: even the best model achieved only 35.5 percent fully correct responses, and harm rates varied substantially across otherwise strong systems. Inter-rater reliability is modest at the final label level (kappa = 0.238) but strong for model ranking (rho = 0.916), supporting stable system-level comparison with expert oversight.

Clinical Questions

Topics

Models Evaluated

Rubric Criteria

Read the Paper

View Sample Data

Model Rankings

Last updated 03/30/2026

Rank Model Score & CI

Results & Analysis

Paper Figures

Figure 1: Model Performance (Ranked by Score)

Mean normalized score (%) with 95% bootstrap confidence intervals (n=10,000). The dashed line marks the 80% threshold for “Correct” classification.

Key finding: Claude Opus 4.6 leads at 72.1%, but no model crosses the 80% “Correct” threshold. Most frontier models cluster in the low-to-mid 60s, suggesting a capability ceiling not visible in saturated benchmarks.

Figure 2: Safety Performance vs Overall Score

Overall normalized score (%) versus safety category mean pass rate (%). Dashed lines mark the median overall score and median safety.

Key finding: Only two models — Claude Opus 4.6 and Claude Sonnet 4.6 — sit in the high-score, high-safety quadrant. The rest of the latest SOTA models cluster in the 80–90% safety band, while open-source models fall below 65%.

Figure 3: Model × Topic Performance Heatmap

Mean normalized score (%) across 10 clinical topics. Darker shading indicates higher scores.

Key finding: Contraception is the most challenging topic overall (lowest cross-model mean), while Hormonal Health/HRT shows the largest cross-model spread. Cancer Screening and Pregnancy also exhibit substantial variance.

Supplementary Analysis

Correctness Distribution

Proportion of fully correct, partially correct, and incorrect responses for each model.

Key finding: Even the best model (Claude Opus 4.6) achieved only 35.5% fully correct responses. Most frontier models produce predominantly partial responses, while lower-ranked models see incorrect rates exceeding 70%.

Harm Rate by Model

Percentage of responses flagged for potential clinical harm across all 22 models.

Critical gap: Harm rates varied substantially across otherwise strong systems. Among the top 5 models, harm ranges from 12.8% (Claude Opus 4.6) to 47.5% (GPT-5.4), despite only an 8-point gap in overall score. Lower-ranked and open-source models reach harm rates above 80%, peaking at 90.8% (Gemini 2.5 Pro).

Score vs. Harm Rate

Relationship between overall WHBench score and clinical harm rate across all 22 models, grouped by model type.

Key insight: High overall scores do not guarantee low harm. GPT-5.4 (#3, 66.8%) and OpenAI o3 (#5, 63.6%) have harm rates of 47.5% and 38.6% respectively, while the similarly-scored Mistral Large (#8, 60.2%) achieves a lower 30.5% harm rate. This uneven relationship underscores the need for safety-specific evaluation beyond aggregate scoring.

Scoring Rubric

Physician-aligned, criteria-level evaluation

The rubric combines detailed criterion scoring with quality controls to assess clinical safety, reasoning, completeness, communication, and equity in women's health scenarios.

Dimensions

Criteria

Scoring Scale

Raw to Normalized %

Failure mode taxonomy: In addition to numeric scoring, each response is checked for high-risk error patterns across six categories: missing critical information, factual/outdated clinical information, health equity gaps, incorrect/harmful treatment recommendation, contraindication or dosage error, and other clinically significant error. This gives a structured safety signal that complements the overall score.

Methodology

How scoring is applied, validated, and quality-checked.

Detailed Scoring

23 criteria across 8 dimensions produce a raw score from −58 to +92, normalized via (raw+58)/150×100 to a 0–100% scale. The rubric uses asymmetric penalties that weigh safety failures more heavily than competence gaps.

Summary Scoring

Responses are classified on a 3-point scale: Correct (≥ 80%), Partially Correct (45–79%), or Incorrect (< 45%).

Scoring Process

An automated LLM judge applies the rubric and outputs structured JSON. Expert validation reviews a sample for calibration. Inter-rater reliability tracks agreement.

Failure Mode Taxonomy

Six categories of high-risk error patterns detected alongside scoring.

1 Missing critical information

2 Factual / outdated clinical information

3 Health equity gaps

4 Incorrect / harmful treatment recommendation

5 Contraindication or dosage error

6 Other clinically significant error

Dimension Guide

Category-level view of all 8 rubric dimensions and their 23 criteria.

Women's HealthBenchmark