Large language models are increasingly consulted for medical information, yet no widely adopted benchmark evaluates their performance on womens health. WHBench introduces 47 expert-crafted clinical scenarios across 10 topics and evaluates 22 models with a 23-criterion safety-weighted rubric. Across 3,100 scored responses, no model mean exceeds 75 percent (top: 72.1 percent), with substantial safety and equity gaps despite strong frontier performance. Inter-rater reliability is modest at the final label level (kappa = 0.238) but strong for model ranking (rho = 0.916), supporting stable system-level comparison with expert oversight.
*To be published
Mean category pass rate for the top 5 models across all runs.
Key finding: Safety pass rates exceed 90% across all top models. But Completeness (especially listing alternatives, at 5 to 68%) and Equity (under 42% on social determinants) remain consistent weak spots, even for the best performers.
Social determinants (F18a) versus inclusive language (F18b) across all models.
Critical gap: Inclusive language pass rates average above 90%, but social determinants of health (addressing insurance barriers, cultural factors, health literacy) average just 24%. That 66-point gap persists across all tested models.
The rubric combines detailed criterion scoring with quality controls to assess clinical safety, reasoning, completeness, communication, and equity in women's health scenarios.
Failure mode taxonomy: In addition to numeric scoring, each response is checked for high-risk error patterns (for example: missed urgency, incorrect treatment, dosage mistakes, factual errors, outdated guidance, or equity gaps). This gives a structured safety signal that complements the overall score.
How scoring is applied, validated, and quality-checked.
Eight categories of high-risk error patterns detected alongside scoring.
Category-level view of all 8 rubric dimensions and their 23 criteria.
Building domain-specific data infrastructure to ensure AI systems reason correctly across critical industries.