Model Evaluation Leaderboard

Domain-grounded reliability benchmarks for regulated medical and financial workflows. All metrics include 95% bootstrapped confidence intervals (n=1,000 resamples). Grounding verified deterministically against source documents before LLM judge scoring.

Best accuracy

93.5%

CI [89.2%, 97.0%]

Best model

GPT-4o Mini

finance

Cost / call

$0.0002

avg generation only

Grounding score

100%

deterministic gate

Cases evaluated

120

4 model runs

DOMAIN

RESULTS — ALL DOMAIN

4 runs

MODEL

ACCURACY + 95% CI

HALLUCINATION RATE

GROUNDING

COST/CALL

P95 LATENCY

CI GATE

GPT-4o Mini

medical

93.3%[90.8–96.0]

50.0%

100%

$0.0002

8.2s

✓ PASS

GPT-4o Mini

finance

93.5%[89.2–97.0]

35.0%

100%

$0.0002

6.4s

✓ PASS

Claude Sonnet 4.6

medical

90.7%[82.7–97.3]

33.3%

95%

$0.0072

11.1s

✓ PASS

Claude Sonnet 4.6

medical

90.7%[82.7–97.3]

33.3%

95%

$0.0072

11.1s

✓ PASS

COST–QUALITY PARETO FRONTIER

X: cost per 1K calls (USD) · Y: accuracy (%) · Pareto-optimal models highlighted

OSFI E-23 — MODEL RISK MANAGEMENT ALIGNMENT

Canadian federally regulated financial institutions must treat generative AI outputs as model outputs under OSFI Guideline E-23 (in force May 2027). This framework addresses the core validation challenge: non-deterministic systems cannot be validated with deterministic tests. Bootstrapped confidence intervals with statistical significance gates, combined with deterministic mechanistic grounding checks, provide the traceable, auditable validation evidence OSFI E-23 requires.