open-design/specs/current/maintainability-roadmap.md
Zakaria a46764fb1b
Some checks failed
ci / Validate workspace (push) Has been cancelled
landing-page-ci / Validate landing page (push) Has been cancelled
landing-page-deploy / Deploy landing page (push) Has been cancelled
github-metrics / Generate repository metrics SVG (push) Has been cancelled
first-commit
2026-05-04 14:58:14 -04:00

11 KiB

Maintainability Roadmap

Purpose

This document captures the maintainability risks in the current apps/web + apps/daemon architecture and the recommended optimization path.

The architectural boundary stays unchanged:

  • apps/web: Next.js frontend and thin BFF/proxy layer.
  • apps/daemon: local runtime/backend for SQLite, .od filesystem state, AI agent CLI processes, and SSE streaming.

The first-principles maintainability goals are:

  • Understandability: engineers can locate behavior quickly and reason about data flow.
  • Changeability: common changes can be made with bounded blast radius.
  • Verifiability: contracts, tests, and types catch regressions early.
  • Isolation: high-risk capabilities are contained behind explicit boundaries.
  • Recoverability: failures produce actionable state, logs, and cleanup behavior.

Priority Scale

Priority Meaning
P0 Blocks safe evolution or creates high-risk runtime/security failure modes.
P1 Major maintainability risk that increases regression and debugging cost.
P2 Medium-term risk that affects reliability, portability, or architecture clarity.
P3 Supporting documentation/process improvement.

Risk List and Optimization Plan

ID Priority Risk Evidence Impact Optimization Plan
R1 P0 Daemon lacks TypeScript type checking. apps/daemon is mostly JavaScript while handling API payloads, SQLite rows, filesystem paths, child processes, and SSE events. API payloads, DB rows, agent events, and task states can drift silently; refactors are riskier. Add gradual TypeScript support with allowJs; write new daemon modules in .ts; first type API payloads, SSE events, task lifecycle, DB rows, and agent definitions.
R2 P0 Web/daemon API contract is implicit. apps/web calls daemon through /api/* rewrites; web has TypeScript types, daemon returns manually shaped JSON. Field mismatches surface at runtime; API evolution is fragile. Create packages/api-contract or an equivalent shared contract layer for request, response, error, and SSE event types.
R3 P0 Runtime validation is incomplete at the daemon boundary. Daemon requests can trigger local filesystem access, SQLite writes, and child_process.spawn(). Type correctness alone cannot protect against malformed runtime input, path traversal, invalid agent IDs, or unsafe args. Add schema validation at HTTP boundaries with Zod/TypeBox; centralize validation for workspace paths, task IDs, agent IDs, models, reasoning options, uploaded files, and command arguments.
R4 P0 Local capability security boundary needs explicit rules. Daemon owns high-permission capabilities: local files, .od, project workspaces, agent CLIs, and logs. Unsafe path handling, broad command execution, token leakage, and unintended workspace access become possible failure modes. Treat daemon as a capability server: bind to localhost, use workspace/path allowlists, normalize and jail paths, allowlist agent commands, and redact sensitive output.
R5 P0 Agent process lifecycle needs a first-class manager. /api/chat spawns multiple agent runtimes and streams output to the frontend. Zombie processes, cancellation gaps, orphaned tasks, inconsistent exit handling, and concurrent process conflicts. Introduce a process/task manager with task state machine, cancellation, timeout, cleanup, exit code capture, signal handling, and concurrency limits.
R6 P1 server.ts is too monolithic. apps/daemon/src/server.ts contains many routes plus orchestration, filesystem logic, streaming, uploads, and artifact handling. Harder to understand, test, and change; unrelated edits share the same file and increase regression risk. Split into thin routes plus services/adapters: routes/, services/, agents/, db/, fs/, streams/, artifacts/.
R7 P1 Error handling is inconsistent. Handlers commonly use local try/catch and return ad hoc JSON errors. UI receives inconsistent failures; logs lose context; task state can stall after partial failures. Define a unified error model with code, message, details, retryable, and requestId/taskId; add centralized Express error middleware and adapter-level error mapping.
R8 P1 SSE protocol is under-specified. Daemon manually writes text/event-stream events for agent output and status. Frontend parsing is fragile; disconnect, heartbeat, terminal events, and error semantics can drift. Version the SSE event contract and define canonical events such as task.started, task.output, task.error, task.completed, task.cancelled, and heartbeat.
R9 P1 SQLite schema and migration lifecycle need stronger guarantees. apps/daemon/src/db.ts owns local better-sqlite3 tables and migrations. Local user data upgrades can fail unpredictably; schema drift is hard to diagnose and recover. Add explicit migration table, ordered forward migrations, startup migration checks, schema version logging, backup-before-migrate strategy, and migration tests.
R10 P1 Test coverage is thin around daemon behavior. Existing daemon tests focus on stream parsing and artifact manifest behavior; HTTP/DB/spawn flows have limited coverage. Changes are validated by manual testing; regressions in filesystem, SQLite, SSE, or agent mocks can ship. Build layered tests: shared contract tests, route integration tests, service unit tests, SQLite migration tests, SSE parser tests, and agent mock integration tests.
R11 P1 Logging and observability are insufficient for local runtime debugging. Agent execution involves long-lived tasks, subprocess output, filesystem state, and frontend SSE consumption. User issues are hard to reproduce; failures lack correlated context. Add structured logs with requestId, taskId, agentId, workspace, exit code, and duration; separate app logs from agent output; redact secrets.
R12 P2 Configuration, port, and health behavior can become fragile. Web proxies /api/* to daemon; dev startup coordinates Next.js and daemon ports. Port conflicts, daemon-not-ready states, and mismatched environment variables can break startup or distribution. Centralize config resolution; expose /health; add daemon readiness checks; make port selection and UI fallback deterministic.
R13 P2 Cross-platform behavior is a recurring risk. Daemon uses filesystem paths, SQLite native bindings, shell/process behavior, and signals. macOS, Linux, and Windows/WSL can differ in path normalization, quoting, permissions, and process termination. Use Node path APIs consistently, avoid shell string composition, isolate platform-specific process logic, and add CI coverage for supported platforms.
R14 P2 Framework migration can distract from core maintainability issues. Current complexity is concentrated in FS/spawn/SSE/SQLite and module boundaries. A framework rewrite can consume time while preserving the risky domain logic. Keep Express for now; revisit Fastify only after TS, contracts, validation, tests, and modularization are in place and Express becomes a clear limiter.
R15 P2 Web/daemon boundary can erode over time. Next.js has BFF capability and daemon has backend capability; future edits may blur ownership. High-permission local runtime logic may leak into apps/web; deployment and security assumptions become unclear. Document and enforce ownership: web handles UI/BFF/proxy; daemon owns local runtime capabilities; shared code contains contracts and pure logic only.
R16 P3 Operational documentation is incomplete. Local-first daemon behavior depends on ports, .od, agent CLIs, runtime logs, and recovery flows. Onboarding and support costs rise; troubleshooting relies on oral knowledge. Document daemon architecture, API/SSE contract, task lifecycle, .od data layout, agent dependency checks, and common recovery procedures.

Optimization Dependencies

The optimization work should proceed in dependency order. Some items can run in parallel once their prerequisites are stable.

Workstream Status Optimization Covers Depends on Output
W1 Completed Confirm architecture and capability boundaries R4, R15 Written ownership rules for web, daemon, shared contracts, and dangerous local capabilities. See specs/current/architecture-boundaries.md.
W2 Completed Define API, SSE, and error contracts R2, R7, R8 W1 packages/contracts now provides shared request/response types, SSE event unions, and error model helpers consumed by web and daemon.
W3 Completed Migrate project-owned code to TypeScript R1 W2 for highest-value shared types Daemon, root scripts, and e2e support now use TypeScript sources; daemon compiles to apps/daemon/dist; residual JS is checked by pnpm check:residual-js.
W4 Planned Add runtime validation at daemon boundaries R3, R4 W2 Schemas for HTTP requests, paths, agents, models, uploads, task IDs, and command args.
W5 Planned Modularize server.ts R6 W2, W3, W4 Thin route handlers plus services/adapters for agents, DB, FS, streams, and artifacts.
W6 Planned Introduce agent process/task manager R5, R8, R11 W2, W5 Task state machine, cancellation, timeout, cleanup, exit handling, and concurrency controls.
W7 Planned Strengthen SQLite migrations R9 W5 or a clear DB adapter boundary Migration table, ordered migrations, startup checks, backup strategy, migration tests.
W8 Planned Build the daemon test pyramid R10 W2, W4, W5 Contract tests, route integration tests, service unit tests, migration tests, SSE tests, and mocked agent-process tests.
W9 Planned Add structured logs and observability R11 W2, W6 Correlated request/task logs, sanitized agent output, durations, exit status, and diagnostic context.
W10 Planned Harden config, port, and readiness behavior R12 W1 Centralized config, /health, readiness checks, deterministic port behavior.
W11 Planned Harden cross-platform behavior R13 W4, W6, W5 Platform-specific process handling, path normalization rules, supported-platform CI.
W12 Planned Revisit HTTP framework choice R14 W2, W3, W4, W5, W8 Evidence-based decision on whether Express remains adequate or Fastify provides clear net value.
W13 Planned Complete operational documentation R16 W1 through W11 as sections stabilize Current-state docs, runbooks, troubleshooting guides, and recovery procedures.
Phase 1: W1 -> W2 -> W3 -> W4
Phase 2: W5 -> W6 -> W7 -> W8
Phase 3: W9 -> W10 -> W11 -> W13
Phase 4: W12

The core principle is to reduce risk before changing framework foundations: establish contracts, types, validation, and module boundaries first; then evaluate whether Express remains the right transport layer.