open-design/specs/current/run.md
Zakaria a46764fb1b
Some checks failed
ci / Validate workspace (push) Has been cancelled
landing-page-ci / Validate landing page (push) Has been cancelled
landing-page-deploy / Deploy landing page (push) Has been cancelled
github-metrics / Generate repository metrics SVG (push) Has been cancelled
first-commit
2026-05-04 14:58:14 -04:00

9.4 KiB

Run Model and Recovery Flow

Purpose

A run is one daemon-owned background execution instance for a user request. It lets the daemon keep an agent task alive across web page refreshes, tab closes, route changes, and temporary SSE disconnects.

The frontend owns presentation state. The daemon owns execution state. SSE owns live subscription and replay.

Concept Model

A project is the top-level design workspace. It contains conversations, owns artifacts, and provides the daemon working directory for agent execution.

A conversation is a thread inside a project. It contains ordered messages and provides the UI context for multi-turn work.

A message is user-visible conversation content. A user message records the request. An assistant message records the generated response and can be backed by one run while generation is active or recoverable.

A run is a daemon-owned execution instance. It belongs to one project and one conversation, and it targets one assistant message. The run starts and supervises one agent process, records execution status, and stores replayable SSE events.

The intended cardinality is:

  • One project contains many conversations.
  • One conversation contains many messages.
  • One project can have many runs.
  • One conversation can have many runs.
  • One assistant message can have zero or one run.
  • One run belongs to one project, one conversation, and one assistant message.
  • One run can start one agent process during active execution.

The recovery path follows the user-visible hierarchy: open a project, load a conversation, find assistant messages with active run metadata, then reattach to the daemon run.

Concept Responsibilities

Project

A project is the design workspace. It provides:

  • project metadata, such as skill, design system, and fidelity;
  • the daemon working directory, usually .od/projects/<projectId>/;
  • artifact ownership;
  • the top-level scope for conversations and runs.

Conversation

A conversation is a thread inside a project. It provides:

  • the ordered message history;
  • the UI context for multi-turn work;
  • the grouping key for active run recovery.

Message

A message is user-visible conversation content. An assistant message is also the durable UI container for a run result. It should store:

  • runId: the daemon execution backing this assistant response;
  • runStatus: the latest known run state;
  • lastRunEventId: the latest applied SSE event ID;
  • partial generated content, persisted during streaming.

Run

A run is a daemon-managed execution instance. It provides:

  • agent process startup;
  • execution status, such as queued, running, succeeded, failed, or canceled;
  • replayable SSE events;
  • reconnect support through events?after=<lastRunEventId>;
  • explicit cancellation through the cancel endpoint.

Each run should carry projectId, conversationId, and assistantMessageId. These fields let the daemon recover active work for a reopened project page and let the frontend attach output to the correct assistant message.

Primary Communication Flow

sequenceDiagram
  participant Web as Web UI
  participant API as Web API Proxy
  participant Daemon as Daemon
  participant Agent as Agent Process
  participant Store as Message Store

  Web->>Web: Generate assistantMessageId and clientRequestId
  Web->>Store: Persist user message
  Web->>Store: Persist empty assistant message with runStatus=running

  Web->>API: POST /api/runs with projectId, conversationId, assistantMessageId, clientRequestId, request payload
  API->>Daemon: Forward POST /api/runs
  Daemon->>Daemon: Create run with status=queued
  Daemon->>Agent: Spawn agent in project workspace
  Daemon->>Daemon: Mark run status=running
  Daemon-->>API: 202 runId
  API-->>Web: 202 runId

  Web->>Store: Persist runId on assistant message
  Web->>API: GET run events
  API->>Daemon: Attach SSE client
  Daemon-->>Web: SSE start event with id

  loop Streaming output
    Agent-->>Daemon: stdout / structured event
    Daemon->>Daemon: Append event to run buffer
    Daemon-->>Web: SSE stdout / agent event with id
    Web->>Web: Apply event to assistant message
    Web->>Store: Throttled persist content and lastRunEventId
  end

  Agent-->>Daemon: Process close
  Daemon->>Daemon: Mark terminal status
  Daemon-->>Web: SSE end event with status
  Web->>Store: Persist final content, runStatus, and lastRunEventId

Refresh and Reattach Flow

sequenceDiagram
  participant Web1 as Web UI Before Refresh
  participant Web2 as Web UI After Refresh
  participant API as Web API Proxy
  participant Daemon as Daemon
  participant Agent as Agent Process
  participant Store as Message Store

  Web1->>API: GET run events
  API->>Daemon: Attach SSE client
  Daemon-->>Web1: SSE events

  Web1-xAPI: Page refresh closes browser subscription
  Note over Daemon,Agent: Run continues in daemon and agent process keeps running
  Agent-->>Daemon: More output while page is unavailable
  Daemon->>Daemon: Buffer events with increasing IDs

  Web2->>Store: Load project conversations and messages
  Store-->>Web2: Assistant message with runId, runStatus, lastRunEventId
  Web2->>API: GET run status
  API->>Daemon: Fetch run status
  Daemon-->>Web2: Run status

  alt Run is active
    Web2->>API: GET run events after lastRunEventId
    API->>Daemon: Reattach SSE client after last applied event
    Daemon-->>Web2: Replay missed events
    Daemon-->>Web2: Continue live SSE events
    Web2->>Store: Persist resumed content and lastRunEventId
  else Run is terminal
    Web2->>API: GET run events after lastRunEventId
    API->>Daemon: Request remaining buffered events
    Daemon-->>Web2: Replay remaining events and end
    Web2->>Store: Persist terminal runStatus
  end

Active Run Fallback Flow

The frontend should persist runId on the assistant message immediately after run creation. A small failure window still exists between daemon run creation and message update. The daemon should also support an active run list endpoint as a recovery fallback.

sequenceDiagram
  participant Web as Web UI
  participant API as Web API Proxy
  participant Daemon as Daemon
  participant Store as Message Store

  Web->>Store: Load messages for project and conversation
  Store-->>Web: Assistant message with runStatus=running and no runId
  Web->>API: GET active runs for project and conversation
  API->>Daemon: Query active runs
  Daemon-->>Web: Active runs with assistantMessageId
  Web->>Web: Match run.assistantMessageId to assistant message ID
  Web->>Store: Persist recovered runId on assistant message
  Web->>API: GET run events after lastRunEventId
  API->>Daemon: Reattach SSE client
  Daemon-->>Web: Replay and live events

Explicit Cancel Flow

Browser subscription lifetime and daemon run lifetime are separate. Refresh, tab close, and route changes close the local subscription only. The daemon receives a cancel request only when the user explicitly clicks Stop.

sequenceDiagram
  participant Web as Web UI
  participant API as Web API Proxy
  participant Daemon as Daemon
  participant Agent as Agent Process
  participant Store as Message Store

  Web->>Web: User clicks Stop
  Web->>API: POST cancel run
  API->>Daemon: Forward cancel request
  Daemon->>Daemon: Mark cancelRequested=true
  Daemon->>Agent: Send SIGTERM
  Agent-->>Daemon: Process closes
  Daemon->>Daemon: Mark run status=canceled
  Daemon-->>Web: SSE end event with status=canceled
  Web->>Store: Persist runStatus=canceled

API Surface

Recommended run APIs:

POST /api/runs
GET  /api/runs/:id
GET  /api/runs/:id/events?after=<lastRunEventId>
GET  /api/runs?projectId=<projectId>&conversationId=<conversationId>&status=active
POST /api/runs/:id/cancel

POST /api/runs should accept correlation fields:

interface ChatRunCreateRequest {
  projectId: string;
  conversationId: string;
  assistantMessageId: string;
  clientRequestId: string;
  agentId: string;
  message: string;
  model?: string | null;
  reasoning?: string | null;
}

GET /api/runs/:id should return enough state for recovery:

interface ChatRunStatusResponse {
  id: string;
  projectId: string;
  conversationId: string;
  assistantMessageId: string;
  agentId: string;
  status: 'queued' | 'running' | 'succeeded' | 'failed' | 'canceled';
  createdAt: number;
  updatedAt: number;
  exitCode?: number | null;
  signal?: string | null;
}

Persistence Phases

Phase 1: Refresh and Tab Close Survival

  • Keep daemon runs in memory.
  • Persist runId, runStatus, lastRunEventId, and partial assistant content in the message store.
  • Reattach after refresh while the daemon process is still alive.
  • Keep terminal run metadata and event buffers long enough for short-term UI recovery.

Phase 2: Daemon Restart Visibility

  • Persist chat_runs and chat_run_events in daemon storage.
  • Mark active runs as interrupted after daemon restart because the child process exits with the daemon.
  • Preserve terminal status and buffered output for user-facing history.

Implementation Rules

  • A browser fetch abort should close only the local SSE subscription.
  • The Stop button is the only UI action that should call /api/runs/:id/cancel.
  • The frontend should persist runId immediately after POST /api/runs succeeds.
  • The frontend should process SSE events idempotently using lastRunEventId.
  • The daemon should allow multiple simultaneous SSE clients for one run.
  • The daemon should expose active runs by project and conversation for fallback recovery.