GENAI RELIABILITY FRAMEWORK
OSFI E-23 ALIGNEDGitHub ↗

Model Evaluation Leaderboard

Domain-grounded reliability benchmarks for regulated medical and financial workflows. All metrics include 95% bootstrapped confidence intervals (n=1,000 resamples). Grounding verified deterministically against source documents before LLM judge scoring.

Best accuracy
93.5%
CI [89.2%, 97.0%]
Best model
GPT-4o Mini
finance
Cost / call
$0.0002
avg generation only
Grounding score
100%
deterministic gate
Cases evaluated
120
4 model runs
DOMAIN

RESULTS — ALL DOMAIN

4 runs
MODEL
ACCURACY + 95% CI
HALLUCINATION RATE
GROUNDING
COST/CALL
P95 LATENCY
CI GATE
GPT-4o Mini
medical
93.3%[90.896.0]
50.0%
100%
$0.0002
8.2s
✓ PASS
GPT-4o Mini
finance
93.5%[89.297.0]
35.0%
100%
$0.0002
6.4s
✓ PASS
Claude Sonnet 4.6
medical
90.7%[82.797.3]
33.3%
95%
$0.0072
11.1s
✓ PASS
Claude Sonnet 4.6
medical
90.7%[82.797.3]
33.3%
95%
$0.0072
11.1s
✓ PASS

COST–QUALITY PARETO FRONTIER

X: cost per 1K calls (USD) · Y: accuracy (%) · Pareto-optimal models highlighted

OSFI E-23 — MODEL RISK MANAGEMENT ALIGNMENT

Canadian federally regulated financial institutions must treat generative AI outputs as model outputs under OSFI Guideline E-23 (in force May 2027). This framework addresses the core validation challenge: non-deterministic systems cannot be validated with deterministic tests. Bootstrapped confidence intervals with statistical significance gates, combined with deterministic mechanistic grounding checks, provide the traceable, auditable validation evidence OSFI E-23 requires.