Women's Health
Benchmark

Large language models are increasingly consulted for medical information, yet no widely adopted benchmark evaluates their performance on women's health. WHBench introduces 47 expert-crafted clinical scenarios across 10 topics and evaluates 22 models with a 23-criterion safety-weighted rubric. Across 3,100 scored responses, no model mean exceeds 75 percent (top: 72.1 percent), and the frontier tier remained tightly clustered, suggesting a capability ceiling not visible in saturated benchmarks. Performance was also uneven: even the best model achieved only 35.5 percent fully correct responses, and harm rates varied substantially across otherwise strong systems. Inter-rater reliability is modest at the final label level (kappa = 0.238) but strong for model ranking (rho = 0.916), supporting stable system-level comparison with expert oversight.

47
Clinical Questions
10
Topics
22
Models Evaluated
23
Rubric Criteria

Model Rankings

Last updated 03/30/2026
Rank Model Score & CI

Results & Analysis

Paper Figures

Figure 1: Model Performance (Ranked by Score)

Mean normalized score (%) with 95% bootstrap confidence intervals (n=10,000). The dashed line marks the 80% threshold for “Correct” classification.

Key finding: Claude Opus 4.6 leads at 72.1%, but no model crosses the 80% “Correct” threshold. Most frontier models cluster in the low-to-mid 60s, suggesting a capability ceiling not visible in saturated benchmarks.

Figure 2: Safety Performance vs Overall Score

Overall normalized score (%) versus safety category mean pass rate (%). Dashed lines mark the median overall score and median safety.

Key finding: Only two models — Claude Opus 4.6 and Claude Sonnet 4.6 — sit in the high-score, high-safety quadrant. The rest of the latest SOTA models cluster in the 80–90% safety band, while open-source models fall below 65%.

Figure 3: Model × Topic Performance Heatmap

Mean normalized score (%) across 10 clinical topics. Darker shading indicates higher scores.

Key finding: Contraception is the most challenging topic overall (lowest cross-model mean), while Hormonal Health/HRT shows the largest cross-model spread. Cancer Screening and Pregnancy also exhibit substantial variance.

Supplementary Analysis

Correctness Distribution

Proportion of fully correct, partially correct, and incorrect responses for each model.

Key finding: Even the best model (Claude Opus 4.6) achieved only 35.5% fully correct responses. Most frontier models produce predominantly partial responses, while lower-ranked models see incorrect rates exceeding 70%.

Harm Rate by Model

Percentage of responses flagged for potential clinical harm across all 22 models.

Critical gap: Harm rates varied substantially across otherwise strong systems. Among the top 5 models, harm ranges from 12.8% (Claude Opus 4.6) to 47.5% (GPT-5.4), despite only an 8-point gap in overall score. Lower-ranked and open-source models reach harm rates above 80%, peaking at 90.8% (Gemini 2.5 Pro).

Score vs. Harm Rate

Relationship between overall WHBench score and clinical harm rate across all 22 models, grouped by model type.

Key insight: High overall scores do not guarantee low harm. GPT-5.4 (#3, 66.8%) and OpenAI o3 (#5, 63.6%) have harm rates of 47.5% and 38.6% respectively, while the similarly-scored Mistral Large (#8, 60.2%) achieves a lower 30.5% harm rate. This uneven relationship underscores the need for safety-specific evaluation beyond aggregate scoring.

Scoring Rubric

Physician-aligned, criteria-level evaluation

The rubric combines detailed criterion scoring with quality controls to assess clinical safety, reasoning, completeness, communication, and equity in women's health scenarios.

Dimensions
8
Criteria
23
Scoring Scale
Raw to Normalized %

Failure mode taxonomy: In addition to numeric scoring, each response is checked for high-risk error patterns across six categories: missing critical information, factual/outdated clinical information, health equity gaps, incorrect/harmful treatment recommendation, contraindication or dosage error, and other clinically significant error. This gives a structured safety signal that complements the overall score.

Methodology

How scoring is applied, validated, and quality-checked.

Detailed Scoring
23 criteria across 8 dimensions produce a raw score from −58 to +92, normalized via (raw+58)/150×100 to a 0–100% scale. The rubric uses asymmetric penalties that weigh safety failures more heavily than competence gaps.
Summary Scoring
Responses are classified on a 3-point scale: Correct (≥ 80%), Partially Correct (45–79%), or Incorrect (< 45%).
Scoring Process
An automated LLM judge applies the rubric and outputs structured JSON. Expert validation reviews a sample for calibration. Inter-rater reliability tracks agreement.

Failure Mode Taxonomy

Six categories of high-risk error patterns detected alongside scoring.

1 Missing critical information
2 Factual / outdated clinical information
3 Health equity gaps
4 Incorrect / harmful treatment recommendation
5 Contraindication or dosage error
6 Other clinically significant error

Dimension Guide

Category-level view of all 8 rubric dimensions and their 23 criteria.

The reasoning and verification layer
for vertical AI.

Building domain-specific data infrastructure to ensure AI systems reason correctly across critical industries.

Read the Paper Hugging FaceView Sample Data Follow on LinkedIn