Model Evaluation Leaderboard
Domain-grounded reliability benchmarks for regulated medical and financial workflows. All metrics include 95% bootstrapped confidence intervals (n=1,000 resamples). Grounding verified deterministically against source documents before LLM judge scoring.
RESULTS — ALL DOMAIN
4 runsCOST–QUALITY PARETO FRONTIER
X: cost per 1K calls (USD) · Y: accuracy (%) · Pareto-optimal models highlighted
Canadian federally regulated financial institutions must treat generative AI outputs as model outputs under OSFI Guideline E-23 (in force May 2027). This framework addresses the core validation challenge: non-deterministic systems cannot be validated with deterministic tests. Bootstrapped confidence intervals with statistical significance gates, combined with deterministic mechanistic grounding checks, provide the traceable, auditable validation evidence OSFI E-23 requires.