11 KiB
11 KiB
Maintainability Roadmap
Purpose
This document captures the maintainability risks in the current apps/web + apps/daemon architecture and the recommended optimization path.
The architectural boundary stays unchanged:
apps/web: Next.js frontend and thin BFF/proxy layer.apps/daemon: local runtime/backend for SQLite,.odfilesystem state, AI agent CLI processes, and SSE streaming.
The first-principles maintainability goals are:
- Understandability: engineers can locate behavior quickly and reason about data flow.
- Changeability: common changes can be made with bounded blast radius.
- Verifiability: contracts, tests, and types catch regressions early.
- Isolation: high-risk capabilities are contained behind explicit boundaries.
- Recoverability: failures produce actionable state, logs, and cleanup behavior.
Priority Scale
| Priority | Meaning |
|---|---|
| P0 | Blocks safe evolution or creates high-risk runtime/security failure modes. |
| P1 | Major maintainability risk that increases regression and debugging cost. |
| P2 | Medium-term risk that affects reliability, portability, or architecture clarity. |
| P3 | Supporting documentation/process improvement. |
Risk List and Optimization Plan
| ID | Priority | Risk | Evidence | Impact | Optimization Plan |
|---|---|---|---|---|---|
| R1 | P0 | Daemon lacks TypeScript type checking. | apps/daemon is mostly JavaScript while handling API payloads, SQLite rows, filesystem paths, child processes, and SSE events. |
API payloads, DB rows, agent events, and task states can drift silently; refactors are riskier. | Add gradual TypeScript support with allowJs; write new daemon modules in .ts; first type API payloads, SSE events, task lifecycle, DB rows, and agent definitions. |
| R2 | P0 | Web/daemon API contract is implicit. | apps/web calls daemon through /api/* rewrites; web has TypeScript types, daemon returns manually shaped JSON. |
Field mismatches surface at runtime; API evolution is fragile. | Create packages/api-contract or an equivalent shared contract layer for request, response, error, and SSE event types. |
| R3 | P0 | Runtime validation is incomplete at the daemon boundary. | Daemon requests can trigger local filesystem access, SQLite writes, and child_process.spawn(). |
Type correctness alone cannot protect against malformed runtime input, path traversal, invalid agent IDs, or unsafe args. | Add schema validation at HTTP boundaries with Zod/TypeBox; centralize validation for workspace paths, task IDs, agent IDs, models, reasoning options, uploaded files, and command arguments. |
| R4 | P0 | Local capability security boundary needs explicit rules. | Daemon owns high-permission capabilities: local files, .od, project workspaces, agent CLIs, and logs. |
Unsafe path handling, broad command execution, token leakage, and unintended workspace access become possible failure modes. | Treat daemon as a capability server: bind to localhost, use workspace/path allowlists, normalize and jail paths, allowlist agent commands, and redact sensitive output. |
| R5 | P0 | Agent process lifecycle needs a first-class manager. | /api/chat spawns multiple agent runtimes and streams output to the frontend. |
Zombie processes, cancellation gaps, orphaned tasks, inconsistent exit handling, and concurrent process conflicts. | Introduce a process/task manager with task state machine, cancellation, timeout, cleanup, exit code capture, signal handling, and concurrency limits. |
| R6 | P1 | server.ts is too monolithic. |
apps/daemon/src/server.ts contains many routes plus orchestration, filesystem logic, streaming, uploads, and artifact handling. |
Harder to understand, test, and change; unrelated edits share the same file and increase regression risk. | Split into thin routes plus services/adapters: routes/, services/, agents/, db/, fs/, streams/, artifacts/. |
| R7 | P1 | Error handling is inconsistent. | Handlers commonly use local try/catch and return ad hoc JSON errors. |
UI receives inconsistent failures; logs lose context; task state can stall after partial failures. | Define a unified error model with code, message, details, retryable, and requestId/taskId; add centralized Express error middleware and adapter-level error mapping. |
| R8 | P1 | SSE protocol is under-specified. | Daemon manually writes text/event-stream events for agent output and status. |
Frontend parsing is fragile; disconnect, heartbeat, terminal events, and error semantics can drift. | Version the SSE event contract and define canonical events such as task.started, task.output, task.error, task.completed, task.cancelled, and heartbeat. |
| R9 | P1 | SQLite schema and migration lifecycle need stronger guarantees. | apps/daemon/src/db.ts owns local better-sqlite3 tables and migrations. |
Local user data upgrades can fail unpredictably; schema drift is hard to diagnose and recover. | Add explicit migration table, ordered forward migrations, startup migration checks, schema version logging, backup-before-migrate strategy, and migration tests. |
| R10 | P1 | Test coverage is thin around daemon behavior. | Existing daemon tests focus on stream parsing and artifact manifest behavior; HTTP/DB/spawn flows have limited coverage. | Changes are validated by manual testing; regressions in filesystem, SQLite, SSE, or agent mocks can ship. | Build layered tests: shared contract tests, route integration tests, service unit tests, SQLite migration tests, SSE parser tests, and agent mock integration tests. |
| R11 | P1 | Logging and observability are insufficient for local runtime debugging. | Agent execution involves long-lived tasks, subprocess output, filesystem state, and frontend SSE consumption. | User issues are hard to reproduce; failures lack correlated context. | Add structured logs with requestId, taskId, agentId, workspace, exit code, and duration; separate app logs from agent output; redact secrets. |
| R12 | P2 | Configuration, port, and health behavior can become fragile. | Web proxies /api/* to daemon; dev startup coordinates Next.js and daemon ports. |
Port conflicts, daemon-not-ready states, and mismatched environment variables can break startup or distribution. | Centralize config resolution; expose /health; add daemon readiness checks; make port selection and UI fallback deterministic. |
| R13 | P2 | Cross-platform behavior is a recurring risk. | Daemon uses filesystem paths, SQLite native bindings, shell/process behavior, and signals. | macOS, Linux, and Windows/WSL can differ in path normalization, quoting, permissions, and process termination. | Use Node path APIs consistently, avoid shell string composition, isolate platform-specific process logic, and add CI coverage for supported platforms. |
| R14 | P2 | Framework migration can distract from core maintainability issues. | Current complexity is concentrated in FS/spawn/SSE/SQLite and module boundaries. | A framework rewrite can consume time while preserving the risky domain logic. | Keep Express for now; revisit Fastify only after TS, contracts, validation, tests, and modularization are in place and Express becomes a clear limiter. |
| R15 | P2 | Web/daemon boundary can erode over time. | Next.js has BFF capability and daemon has backend capability; future edits may blur ownership. | High-permission local runtime logic may leak into apps/web; deployment and security assumptions become unclear. |
Document and enforce ownership: web handles UI/BFF/proxy; daemon owns local runtime capabilities; shared code contains contracts and pure logic only. |
| R16 | P3 | Operational documentation is incomplete. | Local-first daemon behavior depends on ports, .od, agent CLIs, runtime logs, and recovery flows. |
Onboarding and support costs rise; troubleshooting relies on oral knowledge. | Document daemon architecture, API/SSE contract, task lifecycle, .od data layout, agent dependency checks, and common recovery procedures. |
Optimization Dependencies
The optimization work should proceed in dependency order. Some items can run in parallel once their prerequisites are stable.
| Workstream | Status | Optimization | Covers | Depends on | Output |
|---|---|---|---|---|---|
| W1 | Completed | Confirm architecture and capability boundaries | R4, R15 | — | Written ownership rules for web, daemon, shared contracts, and dangerous local capabilities. See specs/current/architecture-boundaries.md. |
| W2 | Completed | Define API, SSE, and error contracts | R2, R7, R8 | W1 | packages/contracts now provides shared request/response types, SSE event unions, and error model helpers consumed by web and daemon. |
| W3 | Completed | Migrate project-owned code to TypeScript | R1 | W2 for highest-value shared types | Daemon, root scripts, and e2e support now use TypeScript sources; daemon compiles to apps/daemon/dist; residual JS is checked by pnpm check:residual-js. |
| W4 | Planned | Add runtime validation at daemon boundaries | R3, R4 | W2 | Schemas for HTTP requests, paths, agents, models, uploads, task IDs, and command args. |
| W5 | Planned | Modularize server.ts |
R6 | W2, W3, W4 | Thin route handlers plus services/adapters for agents, DB, FS, streams, and artifacts. |
| W6 | Planned | Introduce agent process/task manager | R5, R8, R11 | W2, W5 | Task state machine, cancellation, timeout, cleanup, exit handling, and concurrency controls. |
| W7 | Planned | Strengthen SQLite migrations | R9 | W5 or a clear DB adapter boundary | Migration table, ordered migrations, startup checks, backup strategy, migration tests. |
| W8 | Planned | Build the daemon test pyramid | R10 | W2, W4, W5 | Contract tests, route integration tests, service unit tests, migration tests, SSE tests, and mocked agent-process tests. |
| W9 | Planned | Add structured logs and observability | R11 | W2, W6 | Correlated request/task logs, sanitized agent output, durations, exit status, and diagnostic context. |
| W10 | Planned | Harden config, port, and readiness behavior | R12 | W1 | Centralized config, /health, readiness checks, deterministic port behavior. |
| W11 | Planned | Harden cross-platform behavior | R13 | W4, W6, W5 | Platform-specific process handling, path normalization rules, supported-platform CI. |
| W12 | Planned | Revisit HTTP framework choice | R14 | W2, W3, W4, W5, W8 | Evidence-based decision on whether Express remains adequate or Fastify provides clear net value. |
| W13 | Planned | Complete operational documentation | R16 | W1 through W11 as sections stabilize | Current-state docs, runbooks, troubleshooting guides, and recovery procedures. |
Recommended Execution Order
Phase 1: W1 -> W2 -> W3 -> W4
Phase 2: W5 -> W6 -> W7 -> W8
Phase 3: W9 -> W10 -> W11 -> W13
Phase 4: W12
The core principle is to reduce risk before changing framework foundations: establish contracts, types, validation, and module boundaries first; then evaluate whether Express remains the right transport layer.