Accountability surface

Accuracy, reported honestly

Three numbers govern how ContentRX evaluates its own calibration. They are kept separate on purpose — a single “accuracy score” would obscure the self-drift ceiling and misrepresent what the measurement can actually say. This follows Model Cards (Mitchell et al., 2019) guidance on honest metric reporting with intervals and disaggregation.

Built 2026-04-25. Snapshot from 2026-04-24.

Measured system κ

System vs Robo’s held-out golden verdicts

pending

no standards have completed the weekly kappa series

Measured self-drift κ

Robo vs past-Robo (quarterly blind re-label)

pending

Session 7 drift panel awaiting blind re-label + score

Design target κ

A design assumption, not a measurement

0.900

Design assumption · stated separately from measurements

Graduation ladder

Every standard starts at robo_labels. It graduates when (a) its measured weekly κ stays above the threshold derived from the self-drift ceiling and (b) enough novel counterparts have been seen across moments and content types. Thresholds adjust automatically when the ceiling re-measures; see the graduation dashboard for the mechanics.

robo_labels: 43
batch_approval: 0
autonomous: 0

autonomous threshold κ ≥ 0.846 · batch_approval threshold κ ≥ 0.747

Per-standard measurements

43 standards tracked. Cells show the per-standard κ alongside a sparkline of the last weekly measurements. “Pending” means the weekly κ series hasn't been populated yet — never zero, never filled from the design target.

Standard	Level	κ (95% CI)	n
ACC-01	robo_labels	pending	—
ACC-02	robo_labels	pending	—
ACC-05	robo_labels	pending	—
ACC-07	robo_labels	pending	—
ACT-01	robo_labels	pending	—
ACT-02	robo_labels	pending	—
ACT-03	robo_labels	pending	—
ACT-04	robo_labels	pending	—
CLR-01	robo_labels	pending	—
CLR-02	robo_labels	pending	—
CLR-03	robo_labels	pending	—
CLR-04	robo_labels	pending	—
CLR-05	robo_labels	pending	—
CON-01	robo_labels	pending	—
CON-02	robo_labels	pending	—
CON-03	robo_labels	pending	—
CON-04	robo_labels	pending	—
GRM-01	robo_labels	pending	—
GRM-02	robo_labels	pending	—
GRM-03	robo_labels	pending	—
GRM-04	robo_labels	pending	—
GRM-05	robo_labels	pending	—
GRM-06	robo_labels	pending	—
PRF-01	robo_labels	pending	—
PRF-03	robo_labels	pending	—
PRF-04	robo_labels	pending	—
PRF-07	robo_labels	pending	—
PRF-09	robo_labels	pending	—
PRF-10	robo_labels	pending	—
PRF-11	robo_labels	pending	—
STR-01	robo_labels	pending	—
STR-03	robo_labels	pending	—
STR-04	robo_labels	pending	—
TRN-01	robo_labels	pending	—
TRN-02	robo_labels	pending	—
TRN-04	robo_labels	pending	—
TRN-06	robo_labels	pending	—
TRN-07	robo_labels	pending	—
VT-01	robo_labels	pending	—
VT-02	robo_labels	pending	—
VT-03	robo_labels	pending	—
VT-04	robo_labels	pending	—
VT-05	robo_labels	pending	—

Known failure modes

Things ContentRX doesn't reliably catch yet, or reporting choices that might otherwise look like bugs.

Composite accuracy score not reported
The three kappa numbers are kept separate by design. Combining them into one headline number would obscure the self-drift ceiling and misrepresent the measurement.
Known since 2026-04-23
Pre-measurement: weekly kappa series not yet populated
The Session 7 quarterly panel exists (evals/drift/panels) but the blind re-label pass hasn't been scored yet. Per-standard kappa cells will be populated as annotations land. Cells show 'pending' until then — never zero, never a guess.
Known since 2026-04-23
Novel-counterpart coverage is uneven
Every standard needs ≥12 counterpart cases (within-moment, cross-content-type, cross-moment) before it can graduate past robo_labels. Several standards still carry 'no counterparts provided' in the readiness report. Graduation blocked on counterpart acquisition, not on the underlying κ.
Prevalence-driven MCC supplementation
Standards with observed prevalence below 5% trigger MCC (Matthews correlation) supplementation so imbalanced labels don't inflate kappa. The threshold is reported per-standard when it fires.

Review queue phase

No standards have populated kappa series yet. The review queue is in its seeding phase — Robo is annotating the industry corpus.

Current phase: early