Requirements — AI Employee Platform

Version: 1.4 (2026-06-16) · Status: source of truth for what the app must do. v1.1 repositioned the platform as RunStack-owned, sold to clients (Makro = first client), one single-tenant deployment per client hosting multiple AI Employees, two portal roles (RunStack admin + single client role). v1.2 pulled file-attachment ingestion into v1, made adapters onboarding-defined and shareable across employees, added run-recovery via a DB-backed job queue, persona (SOUL) config, a skills registry, action confidence, and the client onboarding artifact pack (client-artifacts/). v1.3 relaxes provider pinning to a per-deployment provider allowlist with redaction as the measured primary privacy control (ADR-007 revised), adds per-employee eval packs, and mandates the employee-package repo layout (shared engine + self-contained employee folders). v1.4 adds FR-39a (staged autonomy ramp: a distinct Shadow stage + per-action-type graduation) and records a post-12.8 implementation-status reconciliation (below).

Implementation-status reconciliation (2026-06-16). A post-Phase-12.8 audit (3 analysts + direct code verification) found several MUST requirements whose mechanism is built and unit-tested but not yet triggered in the deployed system — they are not met until wired (guardrail G-32). These are tracked in the prioritized backlog OPEN-ITEMS.md (P0): FR-29/FR-32 (email delivery — outbox has no email transport; BL-04), FR-19/FR-30 (degraded-mode auto-engage + breaker/DLQ exist but the error-budget evaluator is unscheduled and the breaker/DLQ are unwired; BL-02/BL-06), FR-11a (Payload Jobs Queue has no runner; BL-01), and the approval-expiry/escalation path (zero callers; BL-03). This doc states the requirement; the backlog tracks the wiring. No requirement is being weakened — the gap is liveness, not intent. Companions: design in superpowers/specs/2026-06-06-ai-employee-platform-design.md · decisions in ARCHITECTURE-REVIEW.md (18 ADRs) · review checklist in REVIEW-CRITERIA.md · development invariants in ARCHITECTURE-GUARDRAILS.md.

This document states requirements, not implementation. Each is testable and carries a priority: MUST (v1 blocker), SHOULD (v1 if affordable), DEFER (explicitly out of v1, with a trigger). Requirement IDs are stable references for the plan, reviews, and tests.

1. Vision

A RunStack-owned platform that deploys autonomous AI Employees to client businesses — each employee a role with responsibilities, KPIs, a scorecard, and a command center — owning an operational outcome end-to-end rather than answering ad-hoc queries. RunStack sells this to small and medium businesses that want autonomous AI employees; each client gets their own single-tenant deployment on a separate server, and one deployment can host multiple AI Employees for that client (e.g., an Ops Analyst today, a Project Manager later).

The engagement model is like hiring an employee: during onboarding, RunStack works with the client to define the role's KPIs, scorecard parameters, workflows, and training (directives); the client then monitors the employee's performance through the command center the way they would review a human employee.

The first client is Makro Agency; the first employee is the Ops Analyst: every working day it reads the agency's operational and financial systems, scores the business against a defined scorecard, flags what needs attention, and reports — escalating to humans only when its rules say to. It also answers questions conversationally in Slack. It learns from its own grounded observations over time, within human-set guardrails.

The product is privacy-first (the client's business and end-customer PII), single-tenant by design (one deployment per client), and built so a second AI Employee in the same deployment — or a new client deployment — is a configuration/onboarding exercise, not a rewrite.

2. Users

Two portal roles only: RunStack operates the platform; the client gets a single role (no multi-role client RBAC in v1). Other people at the client (CEO, leads) are report recipients via Slack/email, not portal users.

Persona	Who	Needs from the platform
RunStack admin (super admin)	RunStack platform operator (Rahul)	Operate the deployment: system config, kill switch, onboarding setup (scorecard, KPIs, workflows, directives), cross-employee oversight.
Tenant admin (single client-side role)	The client's designated owner (for Makro: Mike or a designate)	Log in to the command center: see scorecards and employee performance, monitoring/observability, approve/reject actions, review and govern the agent's memory.
Report recipients	Client CEO, ops/delivery leads	A trustworthy daily read on financial + delivery health in Slack/email; targeted flags routed to them; ability to ask follow-ups in Slack. Not portal users.
The AI Employee	The agent itself	Read its memory + data, compute + reason, propose actions, write grounded insights, escalate.

3. Scope

In for v1: one client deployment (Makro), one AI Employee (Ops Analyst) — with the architecture supporting additional employees in the same deployment as config; daily cron analysis + Slack conversational mode; hardened file-attachment ingestion (v1.2); scorecard evaluation; escalation rules; tiered autonomy with human approval for high-impact actions; action confidence guardrail; two-tier self-learning memory; per-employee persona (SOUL) + tenant skills registry, both onboarding-configured; queue-managed run execution with automatic retry; Slack + email delivery; a command center portal where the client logs in for monitoring, scorecard/performance, token/cost visibility, approvals, and memory governance.

Out for v1 (see §7 for triggers): a second AI Employee (Project Manager is the planned next role), multiple client-side portal roles, a multi-agent supervisor, Redis, mid-graph human interrupts, and any portal for the client's own end-customers. A second RunStack client = a new single-tenant deployment, not a multi-tenant build.

4. Functional Requirements

4.1 Data Collection

FR-1 (MUST) — The agent collects data from Productive.io, QuickBooks, Jira, Gmail, and Slack via a uniform adapter interface (BaseAdapter: connect/fetch/transform/health_check/get_metadata).
FR-2 (MUST) — Adding a new data source is one adapter file + registry registration; no change to graph or schema.
FR-2a (MUST) — An employee's adapters are defined during onboarding as configuration (EmployeeAdapterConfigs: which adapters, query scope, cadence) — not code changes per employee.
FR-2b (MUST) — Adapter connections (credentials) are tenant-scoped and shareable across employees in the same deployment: one Connection (e.g., the tenant's Productive.io account) can serve multiple AI Employees, each with its own query scope.
FR-3 (MUST) — Collection tolerates partial failure: if a source is down, the run continues and the report lists the missing source in a Data Quality section.
FR-4 (MUST) — Adapters support both batch mode (cron, pre-collected) and real-time query mode (conversational).
FR-5 (MUST) — (changed in v1.2 — was deferred under ADR-017) File attachments (PDF/CSV/XLSX/images) encountered on sources are ingested in v1 through a hardened pipeline that satisfies the full ING-1…12 contract (REVIEW-CRITERIA.md): extraction at the adapter boundary treated as untrusted, magic-byte type allowlist, size/zip-bomb caps, sandboxed parsing, SSRF controls on embedded URLs, spotlighting of extracted text, Presidio redaction over all extracted text, a PII-in-pixels policy (extract text and send redacted text only — native PDFs/images never reach the model), ZDR-eligible inline API paths only (never Files API/Batch), and provenance + audit before any memory write. Files that fail the pipeline are skipped, logged with provenance, and reported in the Data Quality section.

4.2 Daily Analysis (cron path)

FR-6 (MUST) — On a daily schedule, the agent runs an 11-node analysis: load memory → collect → compute KPIs → evaluate scorecard → check escalations → route autonomy → generate report → grounding gate → consolidate insight → store results.
FR-7 (MUST) — If source data is unchanged since the last run, the agent exits silently consuming zero LLM tokens (wakeAgent gate). (ADR-010)
FR-8 (MUST) — KPI values (utilization, LER, cash runway, AR aging, project burn, DSO, etc.) are computed deterministically in code, never by the LLM.
FR-9 (MUST) — Scorecard ratings are computed deterministically; the LLM writes only the justification text. A rating/justification mismatch is a hard failure.
FR-10 (MUST) — Every numeric claim in the report is verified by a grounding gate against the computed KPIs. Fail → one bounded retry → fail again → block delivery and escalate to a human.
FR-11 (MUST) — Each run is idempotent on (tenant_id, employee_id, run_date, trigger); re-running a day upserts, never duplicates.
FR-11a (MUST) — (new in v1.2) Run execution is queue-managed: the cron trigger enqueues a run job in a Postgres-backed job queue (Payload Jobs Queue — no Redis, consistent with ADR-003). A failed or crashed run is retried automatically with bounded backoff (resuming from the LangGraph checkpoint where possible, safe because runs are idempotent per FR-11); after retries are exhausted, the job parks as failed, alerts a human, and exposes a one-click manual re-run in the portal. The staleness monitor (NFR-10) remains the independent backstop watching for missing output.

4.3 Scorecard & Escalation

FR-12 (MUST) — The scorecard (categories, KPIs, thresholds, rating scale, gate-zero, escalation triggers) is configuration stored in Payload, read at runtime — not hardcoded. A new role = new config, same graph.
FR-13 (MUST) — Escalation is a rule engine evaluating thresholds against the scorecard config; triggers produce flags routed by priority.
FR-14 (SHOULD) — KPI storage is generic (name, value, unit, category, status, threshold) so a new role's KPIs need no schema migration.

4.4 Autonomy & Approvals

FR-15 (MUST) — Actions are routed by configurable tier: AUTO (execute), NOTIFY (alert, no approval), APPROVAL (draft → human approve → send). Tiers are per-action-type role config. (ADR-006)
FR-16 (MUST) — The agent graph runs to completion and produces (report, proposed_actions[]); it never pauses mid-reasoning for approval (outbound-only). (ADR-002)
FR-17 (MUST) — APPROVAL-tier actions are stored as pending_approval; a human approves/rejects in the portal; only then are they delivered.
FR-18 (MUST) — Approval decisions are mapped to an authenticated, authorized user and recorded in the audit log. (SEC-9)
FR-19 (MUST) — An operator kill switch halts all runs and deliveries; a degraded mode forces all actions to APPROVAL tier when an autonomy error budget is breached. (ADR-018)
FR-19a (SHOULD) — (new in v1.2) Every proposed action carries a structured confidence score and evidence references. Confidence is a one-way guardrail: below a configurable threshold it escalates the action's tier (e.g., AUTO → APPROVAL) or suppresses the proposal entirely — it can never loosen a tier or bypass the deterministic rules. Tier routing by configured rules (FR-15) remains primary; self-reported LLM confidence is a secondary signal, not the decision-maker.

4.5 Conversational Mode

FR-20 (MUST) — Users ask the Ops Analyst questions in Slack and receive answers grounded in real-time data + memory.
FR-21 (MUST) — Slack events are acknowledged within 3 seconds and the answer is delivered asynchronously; conversational threads never collide with cron checkpoints. (ADR-018)
FR-22 (MUST) — Conversational answers pass through the same redaction chokepoint as the cron path.
FR-22a (MUST) — (new in v1.2) Each AI Employee has a persona definition (SOUL) — human-authored, versioned config defining personality, tone, voice, response style, language, and boundaries ("never do X"). It is loaded into both conversational and report-generation prompts. Authored with the client during onboarding (artifact 06), changed only by humans, and audit-logged like Directives.

4.6 Memory (Self-Learning)

FR-23 (MUST) — Directives — human-authored, pinned, semantic memory (suppress/override/context/target). The agent reads them and can never write, edit, or override them. (anti-poisoning, ADR-015)
FR-24 (MUST) — Insights — agent-authored episodic memory written at the end of each cron run, each citing the KPI rows that justify it (evidence required). Written via ADD/UPDATE/DELETE/NOOP with a contradiction check, not blind append.
FR-25 (MUST) — Memory is read into evaluation, report generation, and conversational answers ("vs. prior runs we noted…").
FR-26 (MUST) — An Insight is promoted to a Directive only by a human in the portal — never autonomously.
FR-27 (SHOULD) — Insights can be retired/rolled back from the portal, flagging runs that consumed a bad insight for re-review. (RPI-13)
FR-28 (SHOULD) — Escalation-suppressing insights require human review before they influence future runs. (SEC-4)

4.7 Delivery

FR-29 (MUST) — All outbound delivery (Slack, email) is owned by the gateway; the graph only produces delivery intents. (coherence 2c)
FR-30 (MUST) — Delivery is reliable: transactional outbox written atomically with the run, idempotency key per (run_id, channel, recipient), bounded retries with backoff+jitter, per-endpoint circuit breaker, DLQ with scoped replay. (§7.4)
FR-31 (MUST) — Delivery channels are resolved from config allowlists, never free-typed by the model. (SEC-3)
FR-32 (MUST) — Priority routing: CRITICAL → CEO DM + email + dashboard; HIGH → #ops + dashboard; MEDIUM → relevant lead + dashboard; LOW → dashboard. Weekly digest emailed to CEO every Friday.

4.8 Command Center (Portal)

FR-33 (MUST) — The portal shows the scorecard, KPI history, agent run activity, and token/cost visibility — the client's window into employee performance (monitoring only — it cannot author AI Employees; those are configured by RunStack during onboarding via code/seed).
FR-34 (MUST) — The portal provides the approval queue (approve/reject with full context) and the memory console (review/promote/retire Directives + Insights).
FR-35 (MUST) — RBAC: two roles in v1 — super_admin (RunStack platform operator) and tenant_admin (the single client-side role: view + approve + memory governance). Finer-grained client roles (operator, viewer) are deferred (§7). Tenant scope enforced on reads and writes. (SEC-7)
FR-36 (MUST) — Every mutation across collections is recorded in an append-only, immutable audit log. (SEC-8)

4.9 Skills & Onboarding Configuration (new in v1.2)

FR-37 (MUST) — A skills registry: a skill is a named, versioned capability bundle (prompt instructions + allowlisted tools + optional adapter dependencies). Skills are tenant-scoped and assignable to multiple AI Employees in the same deployment; an employee's skill set is configuration, not code.
FR-38 (MUST) — Onboarding a new AI Employee is a configuration exercise captured in the client artifact pack (docs/client-artifacts/): role charter, scorecard/KPIs, workflows, escalation + approval rules, data access/adapters, persona (SOUL), skills/tools, acceptance criteria. Each artifact maps 1:1 to platform config (ScorecardConfigs, AutonomyConfigs, EmployeeAdapterConfigs, PersonaConfigs, SkillConfigs) so the completed pack is the employee's configuration source.
FR-39 (SHOULD) — A new AI Employee starts in a training/probation mode: all actions forced to APPROVAL tier for a configurable ramp period; autonomy widens only by explicit human sign-off against the scorecard.
FR-39a (SHOULD) — (new in v1.4) The autonomy ramp is staged, not binary. (1) A Shadow stage precedes training: the employee runs and produces output that is reviewed privately, delivering nothing externally. (2) Training mode (FR-39) follows: actions are real but all APPROVAL-tier. (3) Graduation widens autonomy per action type, system-enforced — an action type moves AUTO/NOTIFY only on its own human sign-off, never a wholesale flip of the entire tier matrix. Current implementation is binary training_mode (no Shadow stage; graduation is a wholesale flip with per-action narrowing done manually in config) — see backlog BL-09 (Shadow) and BL-10 (per-action graduation). Until built, customer-facing copy describes the binary reality.

5. Non-Functional Requirements

5.1 Privacy & Data Governance

NFR-1 (MUST) — The tenant's business and end-customer PII is redacted before any LLM call and restored on output, at a single chokepoint that fails closed if redaction is unavailable. (ADR-011, SEC-5)
NFR-2 (MUST) — (revised v1.3 — was Anthropic-only pinning) LLM calls are provider-agnostic through LiteLLM, constrained by a per-deployment provider allowlist: each entry records its data-retention posture (ZDR / no-training / retention period) and the client signs the allowlist at onboarding. LiteLLM may route intelligently (cost, latency, fallback) only among allowlisted entries; OpenRouter only with ZDR-endpoints-only enabled. Non-retention-safe features (Files API, Batch, code execution) forbidden on all routes. CI enforces: no route or fallback outside the allowlist. (ADR-007 rev. 2026-06-11, ADR-011)
NFR-2a (MUST) — (new in v1.3) With provider pinning relaxed, redaction is the primary privacy control: the chokepoint remains fail-closed (NFR-1), and redaction effectiveness is measured, not assumed — a per-tenant recall eval (seeded synthetic PII + the tenant's real entity lists: client names, people, projects) runs in CI and on RedactionConfig changes; recall below threshold blocks deploys of redaction-affecting changes.
NFR-3 (MUST) — Observability receives only masked content; no raw PII leaves to any third-party SaaS. (ADR-016, SEC-6)
NFR-4 (SHOULD) — Data minimisation and a documented retention/residency map; tenant data is locatable and deletable across all stores. (GOV-2/3/4)

5.2 Security

NFR-5 (MUST) — Stored credentials are encrypted at rest, hidden from portal reads, decryptable only by the agent service account. (ADR-014, SEC-1)
NFR-6 (MUST) — Every internal seam authenticates; only a TLS reverse proxy is internet-facing; internal services are not published. (ADR-013, SEC-2/10)
NFR-7 (MUST) — Adapter-fetched content is treated as untrusted input (spotlighting, structured output, allowlisted routing). (ADR-015, SEC-3)

5.3 Reliability

NFR-8 (MUST) — Per-service failure modes are defined with graceful degradation; recovery is tested. Agent runs are idempotent and re-runnable. (REL dimension)
NFR-9 (MUST) — A single retry owner per call path (no amplification); durable, replay-safe checkpoints (durability="sync", side-effect-free nodes). (ADR-018, REL-4/7/8)
NFR-10 (MUST) — A dead-man / staleness monitor fires if the expected daily output doesn't appear by deadline. (REL-13)

5.4 Performance

NFR-11 (MUST) — Slack acknowledgment within 3 seconds, answer delivered async. (PERF-1)
NFR-12 (SHOULD) — A documented conversational latency budget with p95 targets; Presidio overhead measured. (PERF-2/3/4)
NFR-13 (MUST) — Per-run and per-day token/cost ceilings enforced at the gateway; explicit recursion limit. (PERF-5)
NFR-13a (SHOULD) — (revised v1.3) Cost-aware model tiering + routing: each graph node declares a model tier in config (cheap/fast for extraction + classification, frontier for reasoning + report prose); LiteLLM resolves tiers via model aliases and may route dynamically (cost/latency/fallback) within the per-deployment provider allowlist (NFR-2). Changing a node's tier mapping is a config change gated by the eval regression suite (NFR-17) — the eval gate, not provider pinning, is what keeps routing changes safe.

5.5 Autonomy & Safety

NFR-14 (MUST) — The Ops Analyst is a bounded workflow, not an open-ended agent: fixed graph, deterministic spine, stopping conditions. (AUTO-1/2/3)
NFR-15 (MUST) — High-impact actions require human approval; tool permissions are least-privilege; escalation is a structured output. (AUTO-4/5)
NFR-16 (MUST) — The agent cannot modify its own behavioral rules; only human promotion changes Directives/prompts. (AUTO-8)
NFR-17 (MUST) — Prompt/model/code changes are gated by offline eval regression on a versioned dataset; production traces scored online. (AUTO-9, spec §11)
NFR-17a (MUST) — (new in v1.3) Each AI Employee ships with its own eval pack living in its employee folder (NFR-24): golden dataset (input snapshots → expected scorecard ratings + report properties), grown from the shadow/training period and from production human feedback (every approve/reject/edit on an action or report is captured as a labeled example). The eval pack is part of onboarding output — a new employee is not graduated (FR-39) without a baseline eval suite.

5.6 Observability & Operability

NFR-18 (MUST) — Structured logs correlated by run_id/trace_id across services; per-node traces with token cost and latency. (OPS-3)
NFR-19 (MUST) — No silent failures: every error produces an activity-log entry, Slack alert, or trace. (§9.4)
NFR-20 (SHOULD) — Full stack runs locally with one command; config from env; runbook for top failure modes. (OPS-1/2/5)
NFR-21 (MUST) — CI enforces the architecture's invariants (no-LLM-import-outside-chokepoint, no-fallback-on-sensitive-route, access-function tests, eval gate). (OPS-7)

5.7 Single-Tenant Fitness

NFR-22 (MUST) — No infrastructure whose failure the single tenant could never perceive (no Redis, no HA cluster) — but expensive-to-retrofit seams (BaseAdapter, tenant_id plumbing, structured contracts, subgraph seam) exist from day one. (FIT dimension, ADR-003/005/012)
NFR-23 (MUST) — Deferred capabilities carry documented reintroduction triggers. (FIT-5)
NFR-24 (MUST) — (new in v1.3) Employee-package repo layout: the engine (graph, nodes, adapters, LLM chokepoint, security, core) is shared platform code; each AI Employee is a self-contained package folder (employees/<role>/) containing only its onboarding artifacts, configs, persona, prompt fragments, eval pack, and employee-specific tests — no engine code. Adding employee #5 must not touch employee #1's folder or the engine; an employee folder containing business logic is an architecture violation (it recreates the fork-per-role problem FR-12 exists to prevent). (spec §15)

6. Success Criteria

The v1 platform is accepted when:

The Ops Analyst runs daily, produces a grounded report with zero ungrounded numeric claims, and delivers it reliably.
The CEO can answer the gate-zero questions (cash, utilization, delivery health, escalations) from the daily output without opening a spreadsheet.
A user gets a correct, grounded answer to a Slack question within the latency budget, with Slack acknowledged in 3s.
The agent writes at least one evidence-backed Insight per run and a human can govern memory from the portal.
APPROVAL-tier actions cannot be delivered without a recorded human decision.
The full security + resilience checklist (Dimensions 1–9 of REVIEW-CRITERIA.md) scores no open P0, with P1s tracked to Phase 10.
No tenant business or end-customer PII reaches the LLM provider unredacted or any third-party SaaS, demonstrated by test.

7. Out of Scope (v1) — with reintroduction triggers

Deferred	Trigger to build	Ref
~~File-attachment ingestion~~ — pulled into v1 (v1.2); must satisfy ING-1…12	—	FR-5, ADR-017 (superseded)
Second AI Employee (e.g., Project Manager) in the same deployment	Ops Analyst stable in production and the client signs the next role. Must be role config + adapters, no graph rewrite.	FR-12
Multiple client-side portal roles (operator, viewer)	The client needs more than one person in the portal with different permissions.	FR-35
Multi-tenant build (shared deployment for multiple clients)	Deliberately not the model — a second RunStack client gets a new single-tenant deployment. Revisit only if per-client servers become operationally untenable. `tenant_id` plumbing exists regardless.	coherence 5c
Multi-agent supervisor	Genuinely breadth-first parallel work exceeding one context window.	ADR-012
Redis	Multiple concurrent employees, sub-second events, or cross-service coordination.	ADR-003
Mid-graph human interrupts	Trust established and a use case that outbound-only can't serve.	ADR-002
Semantic memory search (pgvector)	Insight volume makes filtered queries insufficient.	spec §5.7

8. Traceability

Functional requirements trace to the design spec sections and the ADRs cited inline.
Non-functional requirements trace to REVIEW-CRITERIA.md dimensions (the checklist all reviews score against) and the ADRs.
Every architecture review (per the Review Protocol in REVIEW-CRITERIA.md) confirms the design still satisfies these requirements; a review that would break a MUST requires either a requirement change here or a rejected revision.