Measured system κ
System vs Robo’s held-out golden verdicts
pending
no standards have completed the weekly kappa series
Accountability surface
Three numbers govern how ContentRX evaluates its own calibration. They are kept separate on purpose — a single “accuracy score” would obscure the self-drift ceiling and misrepresent what the measurement can actually say. This follows Model Cards (Mitchell et al., 2019) guidance on honest metric reporting with intervals and disaggregation.
Built 2026-04-25. Snapshot from 2026-04-24.
Measured system κ
System vs Robo’s held-out golden verdicts
pending
no standards have completed the weekly kappa series
Measured self-drift κ
Robo vs past-Robo (quarterly blind re-label)
pending
Session 7 drift panel awaiting blind re-label + score
Design target κ
A design assumption, not a measurement
0.900
Design assumption · stated separately from measurements
Every standard starts at robo_labels. It graduates when (a) its measured weekly κ stays above the threshold derived from the self-drift ceiling and (b) enough novel counterparts have been seen across moments and content types. Thresholds adjust automatically when the ceiling re-measures; see the graduation dashboard for the mechanics.
robo_labelsbatch_approvalautonomousautonomous threshold κ ≥ 0.846 · batch_approval threshold κ ≥ 0.747
43 standards tracked. Cells show the per-standard κ alongside a sparkline of the last weekly measurements. “Pending” means the weekly κ series hasn't been populated yet — never zero, never filled from the design target.
| Standard | Level | κ (95% CI) | n | Trend |
|---|---|---|---|---|
| ACC-01 | robo_labels | pending | — | |
| ACC-02 | robo_labels | pending | — | |
| ACC-05 | robo_labels | pending | — | |
| ACC-07 | robo_labels | pending | — | |
| ACT-01 | robo_labels | pending | — | |
| ACT-02 | robo_labels | pending | — | |
| ACT-03 | robo_labels | pending | — | |
| ACT-04 | robo_labels | pending | — | |
| CLR-01 | robo_labels | pending | — | |
| CLR-02 | robo_labels | pending | — | |
| CLR-03 | robo_labels | pending | — | |
| CLR-04 | robo_labels | pending | — | |
| CLR-05 | robo_labels | pending | — | |
| CON-01 | robo_labels | pending | — | |
| CON-02 | robo_labels | pending | — | |
| CON-03 | robo_labels | pending | — | |
| CON-04 | robo_labels | pending | — | |
| GRM-01 | robo_labels | pending | — | |
| GRM-02 | robo_labels | pending | — | |
| GRM-03 | robo_labels | pending | — | |
| GRM-04 | robo_labels | pending | — | |
| GRM-05 | robo_labels | pending | — | |
| GRM-06 | robo_labels | pending | — | |
| PRF-01 | robo_labels | pending | — | |
| PRF-03 | robo_labels | pending | — | |
| PRF-04 | robo_labels | pending | — | |
| PRF-07 | robo_labels | pending | — | |
| PRF-09 | robo_labels | pending | — | |
| PRF-10 | robo_labels | pending | — | |
| PRF-11 | robo_labels | pending | — | |
| STR-01 | robo_labels | pending | — | |
| STR-03 | robo_labels | pending | — | |
| STR-04 | robo_labels | pending | — | |
| TRN-01 | robo_labels | pending | — | |
| TRN-02 | robo_labels | pending | — | |
| TRN-04 | robo_labels | pending | — | |
| TRN-06 | robo_labels | pending | — | |
| TRN-07 | robo_labels | pending | — | |
| VT-01 | robo_labels | pending | — | |
| VT-02 | robo_labels | pending | — | |
| VT-03 | robo_labels | pending | — | |
| VT-04 | robo_labels | pending | — | |
| VT-05 | robo_labels | pending | — |
Things ContentRX doesn't reliably catch yet, or reporting choices that might otherwise look like bugs.
Composite accuracy score not reported
The three kappa numbers are kept separate by design. Combining them into one headline number would obscure the self-drift ceiling and misrepresent the measurement.
Known since 2026-04-23
Pre-measurement: weekly kappa series not yet populated
The Session 7 quarterly panel exists (evals/drift/panels) but the blind re-label pass hasn't been scored yet. Per-standard kappa cells will be populated as annotations land. Cells show 'pending' until then — never zero, never a guess.
Known since 2026-04-23
Novel-counterpart coverage is uneven
Every standard needs ≥12 counterpart cases (within-moment, cross-content-type, cross-moment) before it can graduate past robo_labels. Several standards still carry 'no counterparts provided' in the readiness report. Graduation blocked on counterpart acquisition, not on the underlying κ.
Prevalence-driven MCC supplementation
Standards with observed prevalence below 5% trigger MCC (Matthews correlation) supplementation so imbalanced labels don't inflate kappa. The threshold is reported per-standard when it fires.
No standards have populated kappa series yet. The review queue is in its seeding phase — Robo is annotating the industry corpus.
Current phase: early