Requirements — AI Employee Platform
Version: 1.4 (2026-06-16) · Status:
source of truth for what the app must do. v1.1 repositioned the
platform as RunStack-owned, sold to clients (Makro = first client), one
single-tenant deployment per client hosting multiple AI Employees, two
portal roles (RunStack admin + single client role). v1.2 pulled
file-attachment ingestion into v1, made adapters onboarding-defined and
shareable across employees, added run-recovery via a DB-backed job
queue, persona (SOUL) config, a skills registry, action confidence, and
the client onboarding artifact pack (client-artifacts/).
v1.3 relaxes provider pinning to a per-deployment provider
allowlist with redaction as the measured primary privacy
control (ADR-007 revised), adds per-employee eval
packs, and mandates the employee-package repo
layout (shared engine + self-contained employee folders). v1.4
adds FR-39a (staged autonomy ramp: a distinct Shadow
stage + per-action-type graduation) and records a post-12.8
implementation-status reconciliation (below).
Implementation-status reconciliation (2026-06-16). A post-Phase-12.8 audit (3 analysts + direct code verification) found several MUST requirements whose mechanism is built and unit-tested but not yet triggered in the deployed system — they are not met until wired (guardrail G-32). These are tracked in the prioritized backlog
OPEN-ITEMS.md(P0): FR-29/FR-32 (email delivery — outbox has no email transport; BL-04), FR-19/FR-30 (degraded-mode auto-engage + breaker/DLQ exist but the error-budget evaluator is unscheduled and the breaker/DLQ are unwired; BL-02/BL-06), FR-11a (Payload Jobs Queue has no runner; BL-01), and the approval-expiry/escalation path (zero callers; BL-03). This doc states the requirement; the backlog tracks the wiring. No requirement is being weakened — the gap is liveness, not intent. Companions: design insuperpowers/specs/2026-06-06-ai-employee-platform-design.md· decisions inARCHITECTURE-REVIEW.md(18 ADRs) · review checklist inREVIEW-CRITERIA.md· development invariants inARCHITECTURE-GUARDRAILS.md.
This document states requirements, not implementation. Each is testable and carries a priority: MUST (v1 blocker), SHOULD (v1 if affordable), DEFER (explicitly out of v1, with a trigger). Requirement IDs are stable references for the plan, reviews, and tests.
1. Vision
A RunStack-owned platform that deploys autonomous AI Employees to client businesses — each employee a role with responsibilities, KPIs, a scorecard, and a command center — owning an operational outcome end-to-end rather than answering ad-hoc queries. RunStack sells this to small and medium businesses that want autonomous AI employees; each client gets their own single-tenant deployment on a separate server, and one deployment can host multiple AI Employees for that client (e.g., an Ops Analyst today, a Project Manager later).
The engagement model is like hiring an employee: during onboarding, RunStack works with the client to define the role's KPIs, scorecard parameters, workflows, and training (directives); the client then monitors the employee's performance through the command center the way they would review a human employee.
The first client is Makro Agency; the first employee is the Ops Analyst: every working day it reads the agency's operational and financial systems, scores the business against a defined scorecard, flags what needs attention, and reports — escalating to humans only when its rules say to. It also answers questions conversationally in Slack. It learns from its own grounded observations over time, within human-set guardrails.
The product is privacy-first (the client's business and end-customer PII), single-tenant by design (one deployment per client), and built so a second AI Employee in the same deployment — or a new client deployment — is a configuration/onboarding exercise, not a rewrite.
2. Users
Two portal roles only: RunStack operates the platform; the client gets a single role (no multi-role client RBAC in v1). Other people at the client (CEO, leads) are report recipients via Slack/email, not portal users.
| Persona | Who | Needs from the platform |
|---|---|---|
| RunStack admin (super admin) | RunStack platform operator (Rahul) | Operate the deployment: system config, kill switch, onboarding setup (scorecard, KPIs, workflows, directives), cross-employee oversight. |
| Tenant admin (single client-side role) | The client's designated owner (for Makro: Mike or a designate) | Log in to the command center: see scorecards and employee performance, monitoring/observability, approve/reject actions, review and govern the agent's memory. |
| Report recipients | Client CEO, ops/delivery leads | A trustworthy daily read on financial + delivery health in Slack/email; targeted flags routed to them; ability to ask follow-ups in Slack. Not portal users. |
| The AI Employee | The agent itself | Read its memory + data, compute + reason, propose actions, write grounded insights, escalate. |
3. Scope
In for v1: one client deployment (Makro), one AI Employee (Ops Analyst) — with the architecture supporting additional employees in the same deployment as config; daily cron analysis + Slack conversational mode; hardened file-attachment ingestion (v1.2); scorecard evaluation; escalation rules; tiered autonomy with human approval for high-impact actions; action confidence guardrail; two-tier self-learning memory; per-employee persona (SOUL) + tenant skills registry, both onboarding-configured; queue-managed run execution with automatic retry; Slack + email delivery; a command center portal where the client logs in for monitoring, scorecard/performance, token/cost visibility, approvals, and memory governance.
Out for v1 (see §7 for triggers): a second AI Employee (Project Manager is the planned next role), multiple client-side portal roles, a multi-agent supervisor, Redis, mid-graph human interrupts, and any portal for the client's own end-customers. A second RunStack client = a new single-tenant deployment, not a multi-tenant build.
4. Functional Requirements
4.1 Data Collection
- FR-1 (MUST) — The agent collects data from
Productive.io, QuickBooks, Jira, Gmail, and Slack via a uniform adapter
interface (
BaseAdapter: connect/fetch/transform/health_check/get_metadata). - FR-2 (MUST) — Adding a new data source is one adapter file + registry registration; no change to graph or schema.
- FR-2a (MUST) — An employee's adapters are
defined during onboarding as configuration
(
EmployeeAdapterConfigs: which adapters, query scope, cadence) — not code changes per employee. - FR-2b (MUST) — Adapter connections (credentials)
are tenant-scoped and shareable across employees in the
same deployment: one
Connection(e.g., the tenant's Productive.io account) can serve multiple AI Employees, each with its own query scope. - FR-3 (MUST) — Collection tolerates partial failure: if a source is down, the run continues and the report lists the missing source in a Data Quality section.
- FR-4 (MUST) — Adapters support both batch mode (cron, pre-collected) and real-time query mode (conversational).
- FR-5 (MUST) — (changed in v1.2 — was deferred
under ADR-017) File attachments (PDF/CSV/XLSX/images) encountered
on sources are ingested in v1 through a hardened
pipeline that satisfies the full ING-1…12 contract
(
REVIEW-CRITERIA.md): extraction at the adapter boundary treated as untrusted, magic-byte type allowlist, size/zip-bomb caps, sandboxed parsing, SSRF controls on embedded URLs, spotlighting of extracted text, Presidio redaction over all extracted text, a PII-in-pixels policy (extract text and send redacted text only — native PDFs/images never reach the model), ZDR-eligible inline API paths only (never Files API/Batch), and provenance + audit before any memory write. Files that fail the pipeline are skipped, logged with provenance, and reported in the Data Quality section.
4.2 Daily Analysis (cron path)
- FR-6 (MUST) — On a daily schedule, the agent runs an 11-node analysis: load memory → collect → compute KPIs → evaluate scorecard → check escalations → route autonomy → generate report → grounding gate → consolidate insight → store results.
- FR-7 (MUST) — If source data is unchanged since the
last run, the agent exits silently consuming zero LLM tokens
(
wakeAgentgate). (ADR-010) - FR-8 (MUST) — KPI values (utilization, LER, cash runway, AR aging, project burn, DSO, etc.) are computed deterministically in code, never by the LLM.
- FR-9 (MUST) — Scorecard ratings are computed deterministically; the LLM writes only the justification text. A rating/justification mismatch is a hard failure.
- FR-10 (MUST) — Every numeric claim in the report is verified by a grounding gate against the computed KPIs. Fail → one bounded retry → fail again → block delivery and escalate to a human.
- FR-11 (MUST) — Each run is idempotent on
(tenant_id, employee_id, run_date, trigger); re-running a day upserts, never duplicates. - FR-11a (MUST) — (new in v1.2) Run execution is queue-managed: the cron trigger enqueues a run job in a Postgres-backed job queue (Payload Jobs Queue — no Redis, consistent with ADR-003). A failed or crashed run is retried automatically with bounded backoff (resuming from the LangGraph checkpoint where possible, safe because runs are idempotent per FR-11); after retries are exhausted, the job parks as failed, alerts a human, and exposes a one-click manual re-run in the portal. The staleness monitor (NFR-10) remains the independent backstop watching for missing output.
4.3 Scorecard & Escalation
- FR-12 (MUST) — The scorecard (categories, KPIs, thresholds, rating scale, gate-zero, escalation triggers) is configuration stored in Payload, read at runtime — not hardcoded. A new role = new config, same graph.
- FR-13 (MUST) — Escalation is a rule engine evaluating thresholds against the scorecard config; triggers produce flags routed by priority.
- FR-14 (SHOULD) — KPI storage is generic
(
name, value, unit, category, status, threshold) so a new role's KPIs need no schema migration.
4.4 Autonomy & Approvals
- FR-15 (MUST) — Actions are routed by configurable tier: AUTO (execute), NOTIFY (alert, no approval), APPROVAL (draft → human approve → send). Tiers are per-action-type role config. (ADR-006)
- FR-16 (MUST) — The agent graph runs to completion
and produces
(report, proposed_actions[]); it never pauses mid-reasoning for approval (outbound-only). (ADR-002) - FR-17 (MUST) — APPROVAL-tier actions are stored as
pending_approval; a human approves/rejects in the portal; only then are they delivered. - FR-18 (MUST) — Approval decisions are mapped to an authenticated, authorized user and recorded in the audit log. (SEC-9)
- FR-19 (MUST) — An operator kill switch halts all runs and deliveries; a degraded mode forces all actions to APPROVAL tier when an autonomy error budget is breached. (ADR-018)
- FR-19a (SHOULD) — (new in v1.2) Every proposed action carries a structured confidence score and evidence references. Confidence is a one-way guardrail: below a configurable threshold it escalates the action's tier (e.g., AUTO → APPROVAL) or suppresses the proposal entirely — it can never loosen a tier or bypass the deterministic rules. Tier routing by configured rules (FR-15) remains primary; self-reported LLM confidence is a secondary signal, not the decision-maker.
4.5 Conversational Mode
- FR-20 (MUST) — Users ask the Ops Analyst questions in Slack and receive answers grounded in real-time data + memory.
- FR-21 (MUST) — Slack events are acknowledged within 3 seconds and the answer is delivered asynchronously; conversational threads never collide with cron checkpoints. (ADR-018)
- FR-22 (MUST) — Conversational answers pass through the same redaction chokepoint as the cron path.
- FR-22a (MUST) — (new in v1.2) Each AI Employee has a persona definition (SOUL) — human-authored, versioned config defining personality, tone, voice, response style, language, and boundaries ("never do X"). It is loaded into both conversational and report-generation prompts. Authored with the client during onboarding (artifact 06), changed only by humans, and audit-logged like Directives.
4.6 Memory (Self-Learning)
- FR-23 (MUST) — Directives — human-authored, pinned, semantic memory (suppress/override/context/target). The agent reads them and can never write, edit, or override them. (anti-poisoning, ADR-015)
- FR-24 (MUST) — Insights — agent-authored episodic memory written at the end of each cron run, each citing the KPI rows that justify it (evidence required). Written via ADD/UPDATE/DELETE/NOOP with a contradiction check, not blind append.
- FR-25 (MUST) — Memory is read into evaluation, report generation, and conversational answers ("vs. prior runs we noted…").
- FR-26 (MUST) — An Insight is promoted to a Directive only by a human in the portal — never autonomously.
- FR-27 (SHOULD) — Insights can be retired/rolled back from the portal, flagging runs that consumed a bad insight for re-review. (RPI-13)
- FR-28 (SHOULD) — Escalation-suppressing insights require human review before they influence future runs. (SEC-4)
4.7 Delivery
- FR-29 (MUST) — All outbound delivery (Slack, email) is owned by the gateway; the graph only produces delivery intents. (coherence 2c)
- FR-30 (MUST) — Delivery is reliable: transactional
outbox written atomically with the run, idempotency key per
(run_id, channel, recipient), bounded retries with backoff+jitter, per-endpoint circuit breaker, DLQ with scoped replay. (§7.4) - FR-31 (MUST) — Delivery channels are resolved from config allowlists, never free-typed by the model. (SEC-3)
- FR-32 (MUST) — Priority routing: CRITICAL → CEO DM + email + dashboard; HIGH → #ops + dashboard; MEDIUM → relevant lead + dashboard; LOW → dashboard. Weekly digest emailed to CEO every Friday.
4.8 Command Center (Portal)
- FR-33 (MUST) — The portal shows the scorecard, KPI history, agent run activity, and token/cost visibility — the client's window into employee performance (monitoring only — it cannot author AI Employees; those are configured by RunStack during onboarding via code/seed).
- FR-34 (MUST) — The portal provides the approval queue (approve/reject with full context) and the memory console (review/promote/retire Directives + Insights).
- FR-35 (MUST) — RBAC: two roles in v1 — super_admin (RunStack platform operator) and tenant_admin (the single client-side role: view + approve + memory governance). Finer-grained client roles (operator, viewer) are deferred (§7). Tenant scope enforced on reads and writes. (SEC-7)
- FR-36 (MUST) — Every mutation across collections is recorded in an append-only, immutable audit log. (SEC-8)
4.9 Skills & Onboarding Configuration (new in v1.2)
- FR-37 (MUST) — A skills registry: a skill is a named, versioned capability bundle (prompt instructions + allowlisted tools + optional adapter dependencies). Skills are tenant-scoped and assignable to multiple AI Employees in the same deployment; an employee's skill set is configuration, not code.
- FR-38 (MUST) — Onboarding a new AI Employee is a
configuration exercise captured in the client artifact pack
(
docs/client-artifacts/): role charter, scorecard/KPIs, workflows, escalation + approval rules, data access/adapters, persona (SOUL), skills/tools, acceptance criteria. Each artifact maps 1:1 to platform config (ScorecardConfigs, AutonomyConfigs, EmployeeAdapterConfigs, PersonaConfigs, SkillConfigs) so the completed pack is the employee's configuration source. - FR-39 (SHOULD) — A new AI Employee starts in a training/probation mode: all actions forced to APPROVAL tier for a configurable ramp period; autonomy widens only by explicit human sign-off against the scorecard.
- FR-39a (SHOULD) — (new in v1.4) The
autonomy ramp is staged, not binary. (1) A
Shadow stage precedes training: the employee runs and
produces output that is reviewed privately, delivering nothing
externally. (2) Training mode (FR-39) follows: actions
are real but all APPROVAL-tier. (3) Graduation widens autonomy
per action type, system-enforced — an action type moves
AUTO/NOTIFY only on its own human sign-off, never a wholesale flip of
the entire tier matrix. Current implementation is binary
training_mode(no Shadow stage; graduation is a wholesale flip with per-action narrowing done manually in config) — see backlog BL-09 (Shadow) and BL-10 (per-action graduation). Until built, customer-facing copy describes the binary reality.
5. Non-Functional Requirements
5.1 Privacy & Data Governance
- NFR-1 (MUST) — The tenant's business and end-customer PII is redacted before any LLM call and restored on output, at a single chokepoint that fails closed if redaction is unavailable. (ADR-011, SEC-5)
- NFR-2 (MUST) — (revised v1.3 — was Anthropic-only pinning) LLM calls are provider-agnostic through LiteLLM, constrained by a per-deployment provider allowlist: each entry records its data-retention posture (ZDR / no-training / retention period) and the client signs the allowlist at onboarding. LiteLLM may route intelligently (cost, latency, fallback) only among allowlisted entries; OpenRouter only with ZDR-endpoints-only enabled. Non-retention-safe features (Files API, Batch, code execution) forbidden on all routes. CI enforces: no route or fallback outside the allowlist. (ADR-007 rev. 2026-06-11, ADR-011)
- NFR-2a (MUST) — (new in v1.3) With provider pinning relaxed, redaction is the primary privacy control: the chokepoint remains fail-closed (NFR-1), and redaction effectiveness is measured, not assumed — a per-tenant recall eval (seeded synthetic PII + the tenant's real entity lists: client names, people, projects) runs in CI and on RedactionConfig changes; recall below threshold blocks deploys of redaction-affecting changes.
- NFR-3 (MUST) — Observability receives only masked content; no raw PII leaves to any third-party SaaS. (ADR-016, SEC-6)
- NFR-4 (SHOULD) — Data minimisation and a documented retention/residency map; tenant data is locatable and deletable across all stores. (GOV-2/3/4)
5.2 Security
- NFR-5 (MUST) — Stored credentials are encrypted at rest, hidden from portal reads, decryptable only by the agent service account. (ADR-014, SEC-1)
- NFR-6 (MUST) — Every internal seam authenticates; only a TLS reverse proxy is internet-facing; internal services are not published. (ADR-013, SEC-2/10)
- NFR-7 (MUST) — Adapter-fetched content is treated as untrusted input (spotlighting, structured output, allowlisted routing). (ADR-015, SEC-3)
5.3 Reliability
- NFR-8 (MUST) — Per-service failure modes are defined with graceful degradation; recovery is tested. Agent runs are idempotent and re-runnable. (REL dimension)
- NFR-9 (MUST) — A single retry owner per call path
(no amplification); durable, replay-safe checkpoints
(
durability="sync", side-effect-free nodes). (ADR-018, REL-4/7/8) - NFR-10 (MUST) — A dead-man / staleness monitor fires if the expected daily output doesn't appear by deadline. (REL-13)
5.4 Performance
- NFR-11 (MUST) — Slack acknowledgment within 3 seconds, answer delivered async. (PERF-1)
- NFR-12 (SHOULD) — A documented conversational latency budget with p95 targets; Presidio overhead measured. (PERF-2/3/4)
- NFR-13 (MUST) — Per-run and per-day token/cost ceilings enforced at the gateway; explicit recursion limit. (PERF-5)
- NFR-13a (SHOULD) — (revised v1.3) Cost-aware model tiering + routing: each graph node declares a model tier in config (cheap/fast for extraction + classification, frontier for reasoning + report prose); LiteLLM resolves tiers via model aliases and may route dynamically (cost/latency/fallback) within the per-deployment provider allowlist (NFR-2). Changing a node's tier mapping is a config change gated by the eval regression suite (NFR-17) — the eval gate, not provider pinning, is what keeps routing changes safe.
5.5 Autonomy & Safety
- NFR-14 (MUST) — The Ops Analyst is a bounded workflow, not an open-ended agent: fixed graph, deterministic spine, stopping conditions. (AUTO-1/2/3)
- NFR-15 (MUST) — High-impact actions require human approval; tool permissions are least-privilege; escalation is a structured output. (AUTO-4/5)
- NFR-16 (MUST) — The agent cannot modify its own behavioral rules; only human promotion changes Directives/prompts. (AUTO-8)
- NFR-17 (MUST) — Prompt/model/code changes are gated by offline eval regression on a versioned dataset; production traces scored online. (AUTO-9, spec §11)
- NFR-17a (MUST) — (new in v1.3) Each AI Employee ships with its own eval pack living in its employee folder (NFR-24): golden dataset (input snapshots → expected scorecard ratings + report properties), grown from the shadow/training period and from production human feedback (every approve/reject/edit on an action or report is captured as a labeled example). The eval pack is part of onboarding output — a new employee is not graduated (FR-39) without a baseline eval suite.
5.6 Observability & Operability
- NFR-18 (MUST) — Structured logs correlated by run_id/trace_id across services; per-node traces with token cost and latency. (OPS-3)
- NFR-19 (MUST) — No silent failures: every error produces an activity-log entry, Slack alert, or trace. (§9.4)
- NFR-20 (SHOULD) — Full stack runs locally with one command; config from env; runbook for top failure modes. (OPS-1/2/5)
- NFR-21 (MUST) — CI enforces the architecture's invariants (no-LLM-import-outside-chokepoint, no-fallback-on-sensitive-route, access-function tests, eval gate). (OPS-7)
5.7 Single-Tenant Fitness
- NFR-22 (MUST) — No infrastructure whose failure the single tenant could never perceive (no Redis, no HA cluster) — but expensive-to-retrofit seams (BaseAdapter, tenant_id plumbing, structured contracts, subgraph seam) exist from day one. (FIT dimension, ADR-003/005/012)
- NFR-23 (MUST) — Deferred capabilities carry documented reintroduction triggers. (FIT-5)
- NFR-24 (MUST) — (new in v1.3)
Employee-package repo layout: the engine (graph, nodes,
adapters, LLM chokepoint, security, core) is shared platform code; each
AI Employee is a self-contained package folder
(
employees/<role>/) containing only its onboarding artifacts, configs, persona, prompt fragments, eval pack, and employee-specific tests — no engine code. Adding employee #5 must not touch employee #1's folder or the engine; an employee folder containing business logic is an architecture violation (it recreates the fork-per-role problem FR-12 exists to prevent). (spec §15)
6. Success Criteria
The v1 platform is accepted when:
- The Ops Analyst runs daily, produces a grounded report with zero ungrounded numeric claims, and delivers it reliably.
- The CEO can answer the gate-zero questions (cash, utilization, delivery health, escalations) from the daily output without opening a spreadsheet.
- A user gets a correct, grounded answer to a Slack question within the latency budget, with Slack acknowledged in 3s.
- The agent writes at least one evidence-backed Insight per run and a human can govern memory from the portal.
- APPROVAL-tier actions cannot be delivered without a recorded human decision.
- The full security + resilience checklist (Dimensions 1–9 of
REVIEW-CRITERIA.md) scores no open P0, with P1s tracked to Phase 10. - No tenant business or end-customer PII reaches the LLM provider unredacted or any third-party SaaS, demonstrated by test.
7. Out of Scope (v1) — with reintroduction triggers
| Deferred | Trigger to build | Ref |
|---|---|---|
| — | FR-5, ADR-017 (superseded) | |
| Second AI Employee (e.g., Project Manager) in the same deployment | Ops Analyst stable in production and the client signs the next role. Must be role config + adapters, no graph rewrite. | FR-12 |
| Multiple client-side portal roles (operator, viewer) | The client needs more than one person in the portal with different permissions. | FR-35 |
| Multi-tenant build (shared deployment for multiple clients) | Deliberately not the model — a second RunStack
client gets a new single-tenant deployment. Revisit only if per-client
servers become operationally untenable. tenant_id plumbing
exists regardless. |
coherence 5c |
| Multi-agent supervisor | Genuinely breadth-first parallel work exceeding one context window. | ADR-012 |
| Redis | Multiple concurrent employees, sub-second events, or cross-service coordination. | ADR-003 |
| Mid-graph human interrupts | Trust established and a use case that outbound-only can't serve. | ADR-002 |
| Semantic memory search (pgvector) | Insight volume makes filtered queries insufficient. | spec §5.7 |
8. Traceability
- Functional requirements trace to the design spec sections and the ADRs cited inline.
- Non-functional requirements trace to
REVIEW-CRITERIA.mddimensions (the checklist all reviews score against) and the ADRs. - Every architecture review (per the Review Protocol in
REVIEW-CRITERIA.md) confirms the design still satisfies these requirements; a review that would break a MUST requires either a requirement change here or a rejected revision.