Large language models are increasingly consulted for medical information, yet no widely adopted benchmark evaluates their performance on women's health. WHBench introduces 47 expert-crafted clinical scenarios across 10 topics and evaluates 22 models with a 23-criterion safety-weighted rubric. Across 3,100 scored responses, no model mean exceeds 75 percent (top: 72.1 percent), and the frontier tier remained tightly clustered, suggesting a capability ceiling not visible in saturated benchmarks. Performance was also uneven: even the best model achieved only 35.5 percent fully correct responses, and harm rates varied substantially across otherwise strong systems. Inter-rater reliability is modest at the final label level (kappa = 0.238) but strong for model ranking (rho = 0.916), supporting stable system-level comparison with expert oversight.
Paper Figures
Mean normalized score (%) with 95% bootstrap confidence intervals (n=10,000). The dashed line marks the 80% threshold for “Correct” classification.
Key finding: Claude Opus 4.6 leads at 72.1%, but no model crosses the 80% “Correct” threshold. Most frontier models cluster in the low-to-mid 60s, suggesting a capability ceiling not visible in saturated benchmarks.
Overall normalized score (%) versus safety category mean pass rate (%). Dashed lines mark the median overall score and median safety.
Key finding: Only two models — Claude Opus 4.6 and Claude Sonnet 4.6 — sit in the high-score, high-safety quadrant. The rest of the latest SOTA models cluster in the 80–90% safety band, while open-source models fall below 65%.
Mean normalized score (%) across 10 clinical topics. Darker shading indicates higher scores.
Key finding: Contraception is the most challenging topic overall (lowest cross-model mean), while Hormonal Health/HRT shows the largest cross-model spread. Cancer Screening and Pregnancy also exhibit substantial variance.
Supplementary Analysis
Proportion of fully correct, partially correct, and incorrect responses for each model.
Key finding: Even the best model (Claude Opus 4.6) achieved only 35.5% fully correct responses. Most frontier models produce predominantly partial responses, while lower-ranked models see incorrect rates exceeding 70%.
Percentage of responses flagged for potential clinical harm across all 22 models.
Critical gap: Harm rates varied substantially across otherwise strong systems. Among the top 5 models, harm ranges from 12.8% (Claude Opus 4.6) to 47.5% (GPT-5.4), despite only an 8-point gap in overall score. Lower-ranked and open-source models reach harm rates above 80%, peaking at 90.8% (Gemini 2.5 Pro).
Relationship between overall WHBench score and clinical harm rate across all 22 models, grouped by model type.
Key insight: High overall scores do not guarantee low harm. GPT-5.4 (#3, 66.8%) and OpenAI o3 (#5, 63.6%) have harm rates of 47.5% and 38.6% respectively, while the similarly-scored Mistral Large (#8, 60.2%) achieves a lower 30.5% harm rate. This uneven relationship underscores the need for safety-specific evaluation beyond aggregate scoring.
The rubric combines detailed criterion scoring with quality controls to assess clinical safety, reasoning, completeness, communication, and equity in women's health scenarios.
Failure mode taxonomy: In addition to numeric scoring, each response is checked for high-risk error patterns across six categories: missing critical information, factual/outdated clinical information, health equity gaps, incorrect/harmful treatment recommendation, contraindication or dosage error, and other clinically significant error. This gives a structured safety signal that complements the overall score.
How scoring is applied, validated, and quality-checked.
Six categories of high-risk error patterns detected alongside scoring.
Category-level view of all 8 rubric dimensions and their 23 criteria.
Building domain-specific data infrastructure to ensure AI systems reason correctly across critical industries.