Files
open-design/specs/current/critique-theater.md
T
Zakaria a46764fb1b
ci / Validate workspace (push) Has been cancelled
landing-page-ci / Validate landing page (push) Has been cancelled
landing-page-deploy / Deploy landing page (push) Has been cancelled
github-metrics / Generate repository metrics SVG (push) Has been cancelled
refresh-contributors-wall / Refresh contributors wall cache bust (push) Waiting to run
first-commit
2026-05-04 14:58:14 -04:00

39 KiB
Raw Blame History

Critique Theater

Naming

  • Internal codename (engineering, code, prompts, env vars, modules, SSE event prefix, telemetry): Critique Theater. Used in apps/daemon/src/critique/, apps/web/src/components/Theater/, OD_CRITIQUE_*, critique.* SSE events.
  • User-facing label (UI, settings, docs, README, marketing copy): Design Jury. The string is sourced from a single i18n key critiqueTheater.userFacingName so the label can be swapped without touching code.

The split is deliberate. Engineers reason about the system; users hire a jury.

Purpose

Critique Theater turns Open Design's single-pass artifact generation into a panel-tempered, scored, replayable process. Every artifact is born through a visible five-person Design Jury (Designer, Critic, Brand, A11y, Copy) running inside one CLI session, with auto-converging rounds bounded by a configurable score threshold. By default, no artifact ships under 8.0/10.

This spec is normative for the v1 implementation and protocol. It is the source of truth that the prompt template, the daemon parser, the SSE event schema, the SQLite columns, the Theater UI components, and the adapter conformance suite all derive from.

Non-goals

  • Not a new skill protocol. Existing skills under skills/ work unchanged.
  • Not a new agent runtime. The daemon spawns the same single CLI per artifact it spawns today.
  • Not a parallel-process architecture. There is exactly one CLI invocation per artifact lifecycle.
  • Not a new transport. SSE on the existing project event stream carries every new event variant.
  • Not a configurable cast in v1. The five panelists are fixed. Cast extension is reserved for v2.

Why each non-goal is excluded

The non-goals above are intentional, not arbitrary. Future readers should know the tradeoff that led to each.

  • No parallel processes. A single CLI session keeps the agent's full context coherent: the Designer's draft and every later panelist's notes share one model context, so the Critic can see the actual hierarchy values the Designer chose, the Brand panelist can read the same DESIGN.md the Designer was passed, and the Copy panelist can pick at the verbs the Designer wrote. Splitting this across processes would require a cross-process artifact handoff and a way to replay prior-panelist notes into each one's context, which adds an estimated two to three weeks to the v1 timeline and turns a debugging session into a multi-process trace correlation problem.
  • No new agent runtime. The same on-PATH CLI Open Design already detects (Claude Code, Codex, Cursor Agent, Gemini CLI, etc.) is how this feature reaches users. Building a runtime would replicate work that the existing daemon already does well, would force users onto a model we picked rather than the one they signed up for, and would lose BYOK at every layer.
  • No new SSE transport. SSE is already plumbed through the daemon, the web app, the Electron shell, and the desktop sidecar IPC. A second transport would force every consumer to learn a second connection lifecycle, which is exactly the kind of accidental complexity that takes a feature out of v1.
  • No configurable cast in v1. A fixed five-panelist roster lets the composite formula and weight defaults stay constant across every artifact. A configurable cast adds UX surface (per-skill picker, override storage, settings sync) and requires the score formula to redistribute weight on the fly. We commit to v2 once we have data on which roles users actually drop or add.
  • No new skill protocol. Critique Theater layers on top of the skill loader; it does not change what a skill is. This means the 31 existing skills, the 129 design systems, and any future contributions inherit the panel without per-skill migration work.

High-level architecture

[ apps/web ]                           [ apps/daemon ]                    [ active CLI ]
  brief form         POST brief          spawnAgent(skill, brief, cfg)      role-plays
  Theater stage   ─────────────────►     compose prompt:                     5 panelists
  score badge                              base skill prompt                 across up to
  transcript                              + design system DESIGN.md          3 rounds in
  replay                                  + panel.ts protocol addendum       one session
                                          + critique config                  emits XML-ish
                  ◄─── SSE events ───────  parse stdout incrementally        tagged stream
  reducer +                                emit panelEvent / roundEvent
  selectors                                / shipEvent / degradedEvent
                                           persist transcript ndjson +
                                           critique columns in SQLite

There are exactly three new modules in the daemon:

Module Responsibility Inputs Outputs
apps/daemon/src/critique/parser.ts Streaming tokenizer for <PANELIST>, <ROUND_END>, <SHIP> blocks. Handles partial chunks, malformed input, recovery. AsyncIterable<string> (CLI stdout) AsyncIterable<PanelEvent>
apps/daemon/src/critique/scoreboard.ts Pure state machine. Consumes PanelEvents, buffers per-round, decides ship vs continue using injected config. No I/O. PanelEvent, CritiqueConfig ScoreboardEvent (delta-driven)
apps/daemon/src/critique/orchestrator.ts Wires parser and scoreboard to the SSE bus and SQLite. Owns interrupt cascade, persistence, and degraded fallback. spawn handle, project id, artifact id side effects: SSE + DB writes

Two new web component groups:

Component group Responsibility
apps/web/src/components/Theater/ Live theater stage, collapsed score badge, transcript replay, interrupted state, degraded banner, interrupt button. Driven entirely by a pure reducer over the SSE event stream.
apps/web/src/prompts/panel.ts The protocol addendum injected into every artifact-generating prompt. Exports PROTOCOL_VERSION.

One new contract package extension:

File Purpose
packages/contracts/src/critique.ts CritiqueConfig zod schema, PANELIST_ROLES constants, PanelEvent discriminated union, CritiqueSseEvent extension to the existing SSE event union.

Configuration

All thresholds, timeouts, and policies are config-driven. There are no magic numbers in business logic. Defaults live in a single defaults.ts next to the schema and are overridable via environment variables read by the existing daemon env layer.

// packages/contracts/src/critique.ts
export const PANELIST_ROLES = ['designer', 'critic', 'brand', 'a11y', 'copy'] as const;
export type PanelistRole = typeof PANELIST_ROLES[number];

export interface CritiqueConfig {
  enabled: boolean;
  cast: PanelistRole[];
  maxRounds: number;
  scoreThreshold: number;
  scoreScale: number;
  weights: Record<PanelistRole, number>;
  perRoundTimeoutMs: number;
  totalTimeoutMs: number;
  parserMaxBlockBytes: number;
  fallbackPolicy: 'ship_best' | 'ship_last' | 'fail';
  protocolVersion: number;
  maxConcurrentRuns: number;
}
Env var Default Notes
OD_CRITIQUE_ENABLED false at M0, true from M3 Master switch. False = legacy generation.
OD_CRITIQUE_MAX_ROUNDS 3 Hard upper bound on rounds.
OD_CRITIQUE_SCORE_THRESHOLD 8.0 Ship gate. Composite below this continues.
OD_CRITIQUE_SCORE_SCALE 10 Score range upper bound. Lower bound is always 0.
OD_CRITIQUE_ROUND_TIMEOUT_MS 90000 Per-round wall clock cap.
OD_CRITIQUE_TOTAL_TIMEOUT_MS 240000 Total run wall clock cap.
OD_CRITIQUE_PARSER_MAX_BLOCK_BYTES 262144 Hard cap on bytes between matched tags. Prevents unbounded buffering.
OD_CRITIQUE_FALLBACK_POLICY ship_best When threshold never met, which round to keep.
OD_CRITIQUE_MAX_CONCURRENT_RUNS os.cpus().length Daemon-wide cap. Excess requests queue per project FIFO.

The default weights for composite score are { designer: 0, critic: 0.40, brand: 0.20, a11y: 0.20, copy: 0.20 }. Designer is omitted from the composite because Designer drafts; Designer does not score. If a panelist's score is missing in a round, weight redistributes proportionally across present panelists. The weights object is a single source of truth; the formula lives only in scoreboard.ts.

Wire protocol (PROTOCOL_VERSION = 1)

The format is XML-ish tagged regions, not JSON. Tagged regions tolerate streaming, partial chunks, and model variability where JSON does not. Existing OD prompt files (directions.ts, discovery.ts) already use this style.

<CRITIQUE_RUN version="1" maxRounds="3" threshold="8.0" scale="10">

  <ROUND n="1">
    <PANELIST role="designer">
      <NOTES>One sentence stating the design intent for v1.</NOTES>
      <ARTIFACT mime="text/html"><![CDATA[
        ... self-contained artifact for round 1 ...
      ]]></ARTIFACT>
    </PANELIST>

    <PANELIST role="critic" score="6.4" must_fix="3">
      <DIM name="hierarchy" score="6">CTA competes with logo at top-left.</DIM>
      <DIM name="type"      score="7">H1 64px reads as poster, not landing.</DIM>
      <DIM name="contrast"  score="4">CTA 3.9:1, fails AA.</DIM>
      <DIM name="rhythm"    score="6">Vertical gaps 12/16/24/40, no system.</DIM>
      <DIM name="space"     score="5">Hero padding asymmetric L vs R.</DIM>
      <MUST_FIX>Push CTA background L by 8% to clear AA.</MUST_FIX>
      <MUST_FIX>Pick a 4px or 8px vertical rhythm.</MUST_FIX>
      <MUST_FIX>Symmetrize hero horizontal padding.</MUST_FIX>
    </PANELIST>

    <PANELIST role="brand" score="7.5"> ... </PANELIST>
    <PANELIST role="a11y" score="5.0"> ... </PANELIST>
    <PANELIST role="copy" score="6.0"> ... </PANELIST>

    <ROUND_END n="1" composite="6.18" must_fix="7" decision="continue">
      <REASON>Composite below threshold 8.0; 7 must-fix open.</REASON>
    </ROUND_END>
  </ROUND>

  <ROUND n="2"> ... </ROUND>
  <ROUND n="3"> ... </ROUND>

  <SHIP round="3" composite="8.62" status="shipped">
    <ARTIFACT mime="text/html"><![CDATA[
      ... final shipped artifact ...
    ]]></ARTIFACT>
    <SUMMARY>One paragraph describing what changed across rounds and why.</SUMMARY>
  </SHIP>

</CRITIQUE_RUN>

Parser invariants

Invariant Enforcement
Tags well-formed and balanced Streaming parser. Malformed input raises MalformedBlockError.
role value is in PANELIST_ROLES Unknown role drops the block, emits warning, run continues.
score is 0 <= n <= scoreScale with one decimal Out of range clamps to bounds and emits warning.
composite matches recomputed mean within ±0.05 Mismatch logs warning; daemon's recomputed value wins.
Each round contains every cast role exactly once Missing role contributes score 0 for that role; must-fix counter advances.
Designer round 1 contains exactly one <ARTIFACT> Missing artifact raises MissingArtifactError and triggers degraded fallback.
<SHIP> appears exactly once or never Duplicates: first wins, rest dropped, warning emitted.
decision is continue or ship Unknown defaults to continue (safe).
Total bytes between matched open and close tags ≤ parserMaxBlockBytes Excess raises OversizeBlockError and triggers degraded fallback.
CDATA appears only inside <ARTIFACT> and <NOTES> Lexer state machine enforces.

Composite score formula

// Pure function in apps/daemon/src/critique/scoreboard.ts.
// weights come from CritiqueConfig.weights, never hardcoded in business logic.
const present = roles.filter(r => panelistScore[r] != null);
const totalWeight = present.reduce((s, r) => s + weights[r], 0);
const composite = present.reduce((s, r) => s + (weights[r] / totalWeight) * panelistScore[r], 0);

Protocol versioning

  • <CRITIQUE_RUN version="1"> is canonical for v1.
  • Parser dispatches to parsers/v1.ts. Future v2 lands in parsers/v2.ts alongside v1; the daemon supports both indefinitely.
  • Each artifact row stores critique_protocol_version. Old artifacts replay through their original parser, never the new one.
  • The prompt template exports PROTOCOL_VERSION as a TypeScript constant; CI fails if this constant is bumped without a new parsers/v{n}.ts file.

Disagreement requirement

Every non-final round must contain at least two panelists with diverging MUST_FIX directives. The Critic and at least one of {Brand, A11y, Copy} must each emit at least one MUST_FIX whose target subsystem differs from the others. The protocol enforces this in the prompt template; the parser emits a WeakDebate warning if the round closes without disagreement, which the orchestrator counts toward parser_errors_total{kind="weak_debate"} for observability but does not fail the run on.

Convergence rule

A round closes with decision="ship" when both:

  • composite >= scoreThreshold
  • Sum of open must_fix counts across panelists is 0

Otherwise the round closes with decision="continue" and the next round begins. After maxRounds, the orchestrator applies fallbackPolicy:

  • ship_best: choose the round with highest composite. Default.
  • ship_last: choose the last completed round.
  • fail: persist the run as below_threshold with no shipped artifact; UI shows recovery actions.

Self-bound budget

Round n+1 transcript bytes must be less than round n transcript bytes. Final shipped artifact must be production-ready: no TODO comments, no Lorem Ipsum, no broken links. The prompt template enforces; the orchestrator runs a final lint pass before persisting the artifact.

SSE event protocol

The existing /api/projects/:id/events SSE stream carries new event variants. All event names live in packages/contracts/src/sse.ts as a discriminated union extension, never as string literals in handlers.

Event name Payload fields Meaning
critique.run_started runId, protocolVersion, cast, maxRounds, threshold, scale Theater handshake. UI initializes lanes from cast, lays out rounds from maxRounds.
critique.panelist_open runId, round, role Tag opened in the stream. UI activates lane caret.
critique.panelist_dim runId, round, role, dimName, dimScore, dimNote Per-dim score line. UI animates the corresponding bar.
critique.panelist_must_fix runId, round, role, text Must-fix directive. UI appends to the lane.
critique.panelist_close runId, round, role, score Tag closed. UI deactivates caret, locks score.
critique.round_end runId, round, composite, mustFix, decision, reason Round closed. UI advances ScoreTicker, RoundDivider increments.
critique.ship runId, round, composite, status, artifactRef, summary Run shipped. UI collapses theater into score badge.
critique.degraded runId, reason, adapter Parser fallback fired. UI replaces theater with banner.
critique.interrupted runId, bestRound, composite User interrupt. UI shows interrupted state.
critique.failed runId, cause Unrecoverable error. UI shows failed state with retry.
critique.parser_warning runId, kind, position Non-fatal parser recovery. UI surfaces to log only, never to user, unless kind == "weak_debate".

The artifactRef payload is a logical reference ({ projectId, artifactId }), not the artifact body. The UI fetches the body through the existing artifact endpoint and renders in the existing sandboxed iframe.

Persistence

A new SQLite migration adds critique columns to the existing artifacts table. The migration is additive and reversible.

-- 00XX_critique_rounds.up.sql
ALTER TABLE artifacts ADD COLUMN critique_score REAL;
ALTER TABLE artifacts ADD COLUMN critique_rounds_json TEXT;
ALTER TABLE artifacts ADD COLUMN critique_transcript_path TEXT;
ALTER TABLE artifacts ADD COLUMN critique_status TEXT
  CHECK (critique_status IN ('shipped','below_threshold','timed_out','interrupted','degraded','failed','legacy'));
ALTER TABLE artifacts ADD COLUMN critique_protocol_version INTEGER;
CREATE INDEX IF NOT EXISTS idx_artifacts_critique_status ON artifacts(critique_status);
-- 00XX_critique_rounds.down.sql
DROP INDEX IF EXISTS idx_artifacts_critique_status;
ALTER TABLE artifacts DROP COLUMN critique_protocol_version;
ALTER TABLE artifacts DROP COLUMN critique_status;
ALTER TABLE artifacts DROP COLUMN critique_transcript_path;
ALTER TABLE artifacts DROP COLUMN critique_rounds_json;
ALTER TABLE artifacts DROP COLUMN critique_score;
Column Format Notes
critique_score REAL Final composite. NULL for legacy artifacts.
critique_rounds_json TEXT Compact summary per round: [{n, composite, mustFix, decision}]. Bounded; full transcript on disk.
critique_transcript_path TEXT Relative path under .od/artifacts/<artifactId>/. The stored value is the relative path; absolute resolution is daemon-side only.
critique_status TEXT Constrained by CHECK clause. legacy marks artifacts produced before the feature shipped.
critique_protocol_version INTEGER Pin which parsers/v{n}.ts to use for replay.

Transcripts are written to .od/artifacts/<artifactId>/transcript.ndjson (one PanelEvent per line). Files larger than 256 KiB are gzipped to transcript.ndjson.gz. The orchestrator chooses the format at write time; the replay path detects the format by extension.

UI surface

All Theater components live under apps/web/src/components/Theater/. Each file is under 200 lines.

Theater/
  index.ts
  TheaterStage.tsx
  PanelistLane.tsx
  ScoreTicker.tsx
  RoundDivider.tsx
  TheaterCollapsed.tsx
  TheaterTranscript.tsx
  TheaterDegraded.tsx
  InterruptButton.tsx
  hooks/
    useCritiqueStream.ts
    useCritiqueReplay.ts
  state/
    reducer.ts
    selectors.ts
  __tests__/
    reducer.test.ts
    TheaterStage.test.tsx

State machine

type CritiqueState =
  | { phase: 'idle' }
  | { phase: 'running'; runId: string; rounds: Round[]; activeRound: number; activePanelist: PanelistRole | null }
  | { phase: 'shipped'; runId: string; rounds: Round[]; final: ShipPayload }
  | { phase: 'degraded'; reason: DegradedReason }
  | { phase: 'interrupted'; runId: string; rounds: Round[]; bestRound: number }
  | { phase: 'failed'; runId: string; cause: FailedCause };

The reducer is pure. Every transition is covered by a vitest case backed by golden-file SSE replays from the adapter conformance suite. No state lives outside the reducer except UI-only ephemera (lane scroll position, badge expanded vs collapsed).

Visual specification

  • Lanes stack vertically inside the right rail with gap: 12px. At viewport widths under 720 px the rail itself becomes a full-width drawer.
  • Per-dim scores render as horizontal bars whose width animates from 0 to (score / scaleMax) * 100% with transition: width 600ms cubic-bezier(.2,.8,.2,1).
  • Animation respects prefers-reduced-motion: reduce; bars snap to final width and the score-ticker stops easing.
  • Lane colors come from CSS custom properties keyed by role. The five role inks are themed against the active design system's OKLch token palette. No hex literals in TSX.
  • Score badge in collapsed mode renders four colored dots labeled C B A W (Critic, Brand, A11y, copy/Word) plus the composite number.

Lane density (compact-by-default, expand-on-demand)

The right rail is dense by nature (5 panelists, multiple dims each, must-fix lists, round dividers). To keep it readable for normal users without losing detail for power users, the panel ships three explicit modes, switchable by a segmented control above the lanes.

Mode Default? Behavior
Smart yes The active panelist (whichever lane is currently streaming, or, post-ship, the panelist with the lowest score) is expanded. All others are collapsed: head + 1-line summary only.
Active only no Only the active panelist is rendered; others are head-only with no summary, used for the most compact view (e.g., embedded in a small workspace tile).
Expand all no Every lane is fully expanded. Power users who want to see every dim and every must-fix at once. Persisted as a per-user preference once chosen explicitly.

Lane heads are clickable to toggle that single lane's collapsed state, independent of the segmented control. The segmented control is the bulk operation; the lane head is the per-lane override.

The collapsed-lane summary is derived, not authored: it is the most prominent must-fix on that panelist, falling back to the highest-impact dim note if there are no must-fixes (panelist already happy). Truncation at one line with ellipsis. Selectors are pure and unit-tested.

"Why this matters" explainer

Immediately under the composite score card, a single rule line states the ship contract in plain terms:

Ships when composite score ≥ scoreThreshold and open must-fix == 0. Otherwise refine, up to maxRounds rounds.

Three rendering variants:

Phase Variant
running "Ships when composite score ≥ 8.0 and open must-fix == 0. Otherwise refine, up to 3 rounds."
shipped "Shipped: composite 8.6 ≥ threshold 8.0 with 0 open must-fix. Consensus N of 5 panelists."
interrupted "Did not meet ship rule (8.0 + 0 must-fix), but you stopped early. Best-of-N was kept, transcript stays available."

Numeric values come from CritiqueConfig, never hardcoded. The rule line uses a soft dashed border to read as explanatory chrome, not as actionable UI.

Label sizing (production, not mockup)

Element Production size Notes
Composite score (big) 36px / mono / 700 the headline number
Per-panelist score 14px / mono / 700 colored to match lane ink
Dim names 13px / mono left column, secondary fg color
Dim numeric scores 13px / mono / 600 right column, primary fg color
Dim notes (sentence) 13px / sans / 1.55 line-height sentence-level explanation
Must-fix body 13px / sans / 1.5 line-height red ink on tinted background
Lane summary (collapsed) 12px / sans / 1.4 line-height derived single line, ellipsis at width
Lane name 14px / sans / 600 role label inside the head
Role tag 11px / mono / uppercase colored badge, fixed-width pill
Round divider 11px / mono / uppercase / letter-spacing 0.1em thin separator
Ship rule explainer 12px / mono / 1.55 line-height dashed-border explanatory block

Label sizes ≤ 11px are reserved for tags, badges, and uppercase chrome only. Body text is never below 12px in production. This rule is enforced by a CI lint that grep-fails any new font-size: <= 11px rule outside the explicit chrome class allowlist (role-tag, dim-dot, round-divider, meta-pill).

Demo-only chrome (must not ship)

The visual companion mockup includes affordances that are not part of the product:

  • The "demo states" tab strip at the top of the rail.
  • Any footer pill labeled "demo · click tabs to walk every state".
  • The state-walking keyboard shortcut.

These exist purely to let reviewers walk every state in one page. The production Theater renders exactly one phase at a time, driven by the SSE stream. CI fails if any string matching data-demo or class demo-tabs appears in production bundles. The mockup HTML lives under .superpowers/brainstorm/ (gitignored) and never ships.

Accessibility

  • Each lane has role="region" with aria-labelledby referencing its title.
  • A single offscreen status node carries aria-live="polite". It announces only round_end and ship events, never per-dim.
  • Keyboard map: Tab cycles lanes, Enter toggles dim detail, Esc triggers Interrupt with confirm, [ / ] step through rounds in collapsed and replay mode.
  • Color is never the only signal: must-fix counts use both color and a numeric badge; dim bars carry both color and width; scores always show as text.
  • The Theater UI is itself audited by a CI test that pipes its rendered DOM through the same WCAG AA rule set the A11y panelist applies.

Performance budget

Metric Budget Enforcement
Stage component bundle ≤ 18 KiB gzipped size-limit in CI
Reducer hot path p99 ≤ 2 ms per event vitest bench in CI
First lane visible from first SSE event ≤ 200 ms Playwright trace in e2e/critique-theater.spec.ts
Transcript scrub at 1 MiB transcript 60 fps, no dropped frames Playwright Page.metrics

Interrupt semantics

Pressing the Interrupt button or Esc while phase === 'running':

  1. Reducer transitions optimistically to a transient interrupting state. The button disables and labels itself "Interrupting…".
  2. Web posts POST /api/projects/:id/critique/:runId/interrupt.
  3. Daemon cascades SIGTERM to the spawned CLI. Orchestrator drains the parser, flushes the scoreboard, applies fallbackPolicy to the rounds completed so far, persists, and emits critique.interrupted.
  4. Reducer advances to phase: 'interrupted'. UI shows the best-completed round's artifact plus a tag indicating which round shipped and the score.

If interrupt fires before any round closes, daemon ships nothing and persists the artifact row as interrupted with no final. The UI shows an empty state offering one-click retry with the original brief.

Replay

Reopening an artifact loads the badge from SQLite columns. Clicking expand triggers useCritiqueReplay, which streams transcript.ndjson (or .gz) line by line into the same reducer. The same component code path renders both live and replay. Replay supports 1×, 4×, and instant playback speeds and a click-to-scrub timeline.

Failure modes

Failure Detection Behavior
Model emits malformed block parser sees opening tag with no close after parserMaxBlockBytes or before close tag parser raises MalformedBlockError, orchestrator falls back to legacy_generation mode for this run, emits critique.degraded with reason. Artifact still ships through the legacy path.
Score never crosses threshold scoreboard reaches maxRounds without consensus apply fallbackPolicy. Score badge tagged below_threshold.
Per-round timeout round wall clock exceeds perRoundTimeoutMs abort current round, scoreboard ships best-so-far, badge tagged timed_out.
Total timeout total wall clock exceeds totalTimeoutMs SIGTERM CLI, ship best-so-far, transcript marked partial.
User Interrupt POST /api/projects/:id/critique/:runId/interrupt cascade SIGTERM, persist partial state, ship best-so-far if any round closed, otherwise mark interrupted with no final.
CLI process crash spawn handle exits non-zero before <SHIP> persist partial transcript, mark failed with rounds_json.cause, emit critique.failed, never silently retry.
Daemon restart mid-run next boot scans SQLite for rows in state running older than totalTimeoutMs mark interrupted with rounds_json.recoveryReason = "daemon_restart". Never auto-resume.
Adapter unsupported adapter fails conformance test in nightly CI adapter marked critique:degraded in adapter registry with 24h TTL. UI shows the degraded banner once per session per adapter.

Failure-mode rate targets and recovery

Each failure mode above has an empirical rate target, a deterministic recovery path, and a Prometheus signal so the Phase 12 dashboard and the Phase 11 e2e suite can both pin sane thresholds. Targets are starting values; we tune them with the first 1000 production runs.

Failure Target rate Recovery Prometheus signal Alert threshold
malformed_block < 0.5% of runs per adapter emit critique.degraded, fall through to legacy single-pass generation. No retry. open_design_critique_degraded_total{reason="malformed_block",adapter="..."} sustained > 2% over 1h on any single adapter
oversize_block < 0.1% of runs same as malformed_block plus the parser's position is logged for postmortem open_design_critique_degraded_total{reason="oversize_block"} any non-zero sustained rate is treated as a model regression and pages
missing_artifact < 0.2% of runs degraded fallback, prompt template flagged for review (it should make this impossible) open_design_critique_degraded_total{reason="missing_artifact"} > 1% over 24h
Score never crosses threshold < 5% of runs at M3 default apply fallbackPolicy (default ship_best); badge tagged below_threshold; user can re-run open_design_critique_runs_total{status="below_threshold"} > 15% sustained over 24h is a quality regression
Per-round timeout < 1% of runs abort current round, ship best-so-far, badge tagged timed_out open_design_critique_runs_total{status="timed_out"} > 3% over 1h on any adapter
Total timeout < 0.5% of runs SIGTERM CLI, ship best-so-far, transcript marked partial same series, distinguished by lifecycle attribute shares the per-round timeout alert
User Interrupt not an error; signal of latency or unwanted direction preserve transcript, ship best-so-far if any round closed open_design_critique_interrupted_total{adapter="..."} > 10% over 24h is a UX problem worth investigating
CLI process crash < 0.05% of runs fail loud with critique.failed, no silent retry, daemon process surfaces the cause open_design_critique_runs_total{status="failed",cause="cli_exit_nonzero"} any non-zero sustained rate pages
Daemon restart mid-run unbounded (depends on operator action) persist interrupted with recoveryReason="daemon_restart", never auto-resume open_design_critique_runs_total{status="interrupted",cause="daemon_restart"} informational, no alert
Adapter unsupported adapter-specific; surfaces during conformance adapter marked critique:degraded in registry with 24h TTL; degraded banner once per session open_design_critique_degraded_total{reason="adapter_unsupported",adapter="..."} one alert per adapter on first hit, suppressed during TTL

A run can satisfy multiple labels (a timed_out run that also below_thresholds, for instance). The status label is a single canonical value derived by the orchestrator at run end; downstream histograms attach cause and decision as separate labels.

The Phase 11 e2e suite synthesizes each row above against a stub adapter and asserts the recovery actually fires. Phase 12 imports the alert thresholds into the Grafana dashboard JSON committed at tools/dev/dashboards/critique.json so an operator never has to guess.

Concurrency and scalability

  • Orchestrator instances are per-project. Daemon-wide concurrency cap via OD_CRITIQUE_MAX_CONCURRENT_RUNS. Excess requests queue with project-level FIFO.
  • Streaming backpressure: parser is AsyncIterable-based. When the SSE consumer is slow, daemon's bounded mailbox (256 events) applies backpressure to the parser, which applies it to CLI stdout via the existing sidecar transport. No unbounded buffers anywhere.
  • Transcripts larger than 1 MiB stream to disk only; the SQLite row stores a path, never a blob.
  • Cross-skill reuse: every existing skill receives the panel for free; no per-skill code change is required.
  • Adapter neutrality: the panel prompt is plain text with no CLI-specific tokens. Every adapter is tested via the conformance harness.

Skills protocol extension

Skills opt out via a new optional frontmatter field in SKILL.md:

od:
  critique:
    policy: always | on-demand | off

Default is always once OD_CRITIQUE_ENABLED flips global at M3. Per-skill overrides ship in the same PR as M2.

Adapter conformance

For each of the 12 CLI adapters and the BYOK proxy, the conformance suite runs a deterministic mini-brief at temperature=0 (where supported) and asserts:

  • The parser consumes the stream without MalformedBlockError.
  • All required tags appear in the canonical order.
  • Composite score matches recomputed mean within ±0.05.
  • <SHIP> artifact body parses as well-formed HTML.

Adapters that fail are marked critique:degraded in the adapter registry. The daemon falls back to legacy generation for that adapter and shows the degraded banner once per session. We never pretend feature parity that does not exist.

Observability

Metric Type Labels Purpose
open_design_critique_runs_total counter status, adapter, skill Run volume by terminal state.
open_design_critique_rounds_total counter adapter, skill Average rounds per artifact.
open_design_critique_round_duration_ms histogram quantile, adapter, skill, round Round latency distribution.
open_design_critique_composite_score histogram quantile, adapter, skill Output quality distribution.
open_design_critique_must_fix_total counter panelist, dim, adapter, skill Where the panel finds problems most often.
open_design_critique_degraded_total counter reason, adapter Adapter health proxy.
open_design_critique_interrupted_total counter adapter User abandonment signal.
open_design_critique_parser_errors_total counter kind, adapter Parser robustness.
open_design_critique_protocol_version gauge version Active protocol versions in use.

Structured logs use the existing daemon logger with namespace critique. Required events: run_started, round_closed, run_shipped, degraded, parser_recover, run_failed. OpenTelemetry traces wrap each run with spans critique.run, critique.round.<n>, critique.parse_chunk, critique.scoreboard_eval, critique.persist_round, critique.ship.persist.

A Grafana dashboard ships in tools/dev/dashboards/critique.json with three default views: fleet quality (composite p50/p90/p99 over time per adapter), adapter health (degraded ratio plus parser-error rate), and brief throughput (runs per hour, rounds per run, time-to-ship).

Security

Surface Threat Mitigation
<ARTIFACT> body XSS in shipped HTML Existing sandboxed iframe pattern with sandbox and CSP headers. No new surface.
Transcript on disk Path traversal via stored path The SQLite column stores a relative path; the daemon resolves under .od/artifacts/<artifactId>/; no user-supplied component reaches the resolver.
Brand DESIGN.md content in prompt Prompt injection from a malicious system DESIGN.md is wrapped in a <BRAND_SOURCE> block whose framing instructs the agent to treat it as data. Same defence as the existing skill loader.
Score and must-fix from agent stdout Log injection (newlines, ANSI) Parser strips ANSI; logs JSON-encode every value; UI renders as text only.
Pathological agent output DoS via unbounded buffers parserMaxBlockBytes, perRoundTimeoutMs, totalTimeoutMs are bounded and config-driven. Orchestrator enforces hard kill.
BYOK proxy invocation SSRF / internal-IP exfil Existing /api/proxy/stream blocks internal IPs at the daemon edge. Unchanged.

A separate security review pass runs through the code-reviewer agent before merge.

Testing

Layer Tool Gate
Pure unit vitest 95% line and 100% branch on critique/parser.ts, scoreboard.ts, reducer.ts.
Golden-file fixtures vitest with __fixtures__/critique/v1/*.txt Each adapter has at least one happy and two malformed transcripts on disk.
Component RTL + jsdom Every reducer phase rendered at least once.
Integration vitest + sqlite memory + http mock End-to-end happy path plus five failure modes.
Adapter conformance nightly e2e against live adapters Each of 12 CLIs plus BYOK proxy must pass canonical brief.
Playwright e2e e2e/critique-theater.spec.ts Theater renders within 200 ms, Esc triggers Interrupt, replay scrub at 60 fps.
Visual regression Playwright toHaveScreenshot() Each Theater state captured at 375 / 768 / 1280 viewports.
A11y self-test axe-playwright Theater UI passes WCAG AA.
Performance size-limit + vitest bench Bundle ≤ 18 KiB gz; reducer p99 ≤ 2 ms.
Dead-code ts-prune scoped to critique/ and Theater/ Zero unreferenced exports.
i18n existing duplicate-key check plus new missing-key check All 6 locales present for every Theater string.
Coverage walker new pnpm check:critique-coverage Each CritiqueConfig field, PanelEvent variant, SSE event, SQLite column, protocol grammar element, and i18n key has at least one production reference and one test.

Rollout

Milestone Scope Default Reversibility
M0 Code lands behind OD_CRITIQUE_ENABLED=false. All tests run. No user-visible change. off flip env var
M1 Settings UI toggle "Critique Theater (beta)". Adapter conformance grades published in README. 24h TTL on degraded markings. off flip toggle
M2 Default-on for skills that benefit most: magazine-poster, saas-landing, dashboard, finance-report, hr-onboarding, kanban-board. Lightweight skills (weekly-update, simple-deck) remain opt-in. Per-skill od.critique.policy introduced in SKILL.md. per-skill per-skill flag
M3 After 14 consecutive days at ≥ 90% adapter conformance across the fleet, flip the global default to true. Opt-out remains per-run and per-skill. on env var or per-skill

Rollback at any milestone flips the master env var. SQLite columns are additive, never required for reads. Existing artifacts keep their badge; new artifacts go through legacy generation. No data migration is needed in either direction.

Documentation deliverables

Every item is a CI gate. Missing documentation fails the build.

  • docs/critique-theater.md, user-facing how-it-works with screenshots of all five states and an adapter compatibility table.
  • docs/spec.md, adds a "Critique Theater protocol v1" section with the full wire grammar.
  • docs/architecture.md, adds the apps/daemon/src/critique/ module diagram.
  • docs/skills-protocol.md, adds od.critique.policy field documentation and extension points.
  • docs/agent-adapters.md, adds the conformance test contract every adapter must satisfy.
  • docs/roadmap.md, adds future panelist extensions (Perf, i18n, Motion, Cost) as future work, not committed scope.
  • apps/daemon/src/critique/AGENTS.md, module-level guide per the existing AGENTS.md convention.
  • apps/web/src/components/Theater/AGENTS.md, same.
  • README.md, single line in the "What you get" table: Critique Theater, every artifact panel-tempered, scored, replayable.
  • All six locales (DE, JA, KO, zh-CN, zh-TW, EN) gain new strings in the same PR. Existing duplicate-key i18n CI gate covers consistency.

Open questions

None. All foundational decisions are locked in this spec. Future work is reserved for v2 (configurable cast, additional panelist roles, multi-CLI parallel debate, score-driven prompt-stack feedback loop) and is out of v1 scope.