Women's Health
Benchmark

Large language models are increasingly consulted for medical information, yet no widely adopted benchmark evaluates their performance on womens health. WHBench introduces 47 expert-crafted clinical scenarios across 10 topics and evaluates 22 models with a 23-criterion safety-weighted rubric. Across 3,100 scored responses, no model mean exceeds 75 percent (top: 72.1 percent), with substantial safety and equity gaps despite strong frontier performance. Inter-rater reliability is modest at the final label level (kappa = 0.238) but strong for model ranking (rho = 0.916), supporting stable system-level comparison with expert oversight.

47
Clinical Questions
10
Topics
22
Models Evaluated
23
Rubric Criteria

*To be published

Model Rankings

Last updated 03/08/2026
Rank Model Score & CI

Results & Analysis

Performance by Rubric Category

Mean category pass rate for the top 5 models across all runs.

Key finding: Safety pass rates exceed 90% across all top models. But Completeness (especially listing alternatives, at 5 to 68%) and Equity (under 42% on social determinants) remain consistent weak spots, even for the best performers.

Equity & Inclusivity Analysis

Social determinants (F18a) versus inclusive language (F18b) across all models.

Critical gap: Inclusive language pass rates average above 90%, but social determinants of health (addressing insurance barriers, cultural factors, health literacy) average just 24%. That 66-point gap persists across all tested models.

Scoring Rubric

Physician-aligned, criteria-level evaluation

The rubric combines detailed criterion scoring with quality controls to assess clinical safety, reasoning, completeness, communication, and equity in women's health scenarios.

Dimensions
8
Criteria
23
Scoring Scale
Raw to Normalized %

Failure mode taxonomy: In addition to numeric scoring, each response is checked for high-risk error patterns (for example: missed urgency, incorrect treatment, dosage mistakes, factual errors, outdated guidance, or equity gaps). This gives a structured safety signal that complements the overall score.

Methodology

How scoring is applied, validated, and quality-checked.

Detailed Scoring
23 criteria across 8 dimensions produce a raw score from −25 to +69, normalized to a 0–100% scale.
Summary Scoring
Responses are classified on a 3-point scale: Correct (≥ 70%), Partially Correct (40–69%), or Incorrect (< 40%).
Scoring Process
An automated LLM judge applies the rubric and outputs structured JSON. Expert validation reviews a sample for calibration. Inter-rater reliability tracks agreement.

Failure Mode Taxonomy

Eight categories of high-risk error patterns detected alongside scoring.

1 Outdated guidelines
2 Missed urgency
3 Dosage errors
4 Incorrect treatment
5 Missing information
6 Factual errors
7 Missed diagnosis
8 Health equity gaps

Dimension Guide

Category-level view of all 8 rubric dimensions and their 23 criteria.

The reasoning and verification layer
for vertical AI.

Building domain-specific data infrastructure to ensure AI systems reason correctly across critical industries.

Read the Paper* Hugging FaceView Sample Data Follow on LinkedIn