Technical Whitepaper · v1 · 2026

Forge: A platform for generating personalized, self-modifying AI agents

A 5-wave context interview produces a complete agent: 30 capabilities wired by default, 96 tools, persistent memory, a heartbeat loop, and the ability to edit its own voice, identity, and code. This document describes the architecture, the security posture, and the trade-offs.

myagentos.aigithub.com/dfritz603-afk/scout-platform325/325 critical tests53 PRs · latest f582e4d

Abstract

Forge is a platform that generates personalized, autonomous AI agents from a 5-wave context interview. The problem it solves is that real agents — ones with persistent memory, self-scheduled behavior, tool access, and an identity that compounds across sessions — take months of engineering per agent and do not survive the next LLM upgrade. Forge factors that engineering into a platform with three layers: an architect that captures user context into a DesignSpec, a generator that emits a complete runnable agent, and a runtime (scout_runtime) that drives the cognitive loop. The key technical decisions are a single-source-of-truth catalog (Python schema, JS bindings auto-generated), a workspace-as- identity model (markdown files compose the system prompt), a tool registry with CI-enforced alignment between docs and runtime, and a sandboxed self-modification surface so agents can edit their own voice, rules, memory, and code. The result is a 5-minute interview that produces, by default, an agent with 30 capabilities, 96 tools, three deployment modes, and 325 passing critical tests behind it.

1. The problem

A modern LLM is a stateless single-turn function. You hand it a prompt, it returns a completion. That is useful and not an agent. An agent is what happens when you wrap that function with the machinery that lets it act over time on behalf of a specific person: persistent memory, self-scheduling, autonomous behavior between user turns, tool integration with credentials and error handling, and an identity that gets sharper instead of resetting every session.

Building that wrapper for one agent — say, a sales-ops assistant that lives in your Slack, knows your HubSpot, follows up on warm leads, drafts your weekly board update, and remembers that you don't want it texting you between 10pm and 7am — is two to three months of engineering. You need a catalog of capabilities, a tool registry, a workspace layout, a memory substrate, a scheduler, a heartbeat, an identity assembly step, a voice system, a sandbox, a packaging story for macOS / Linux / Windows, transport adapters for REPL / HTTP / ACP, secret handling, and a security review.

You also need it to keep working when Anthropic ships a new model cap, when a tool API changes its schema, when the user installs on Windows and discovers Python doesn't ship tzdata there by default, and when the agent itself learns something incorrect and persists it into MEMORY.md.

Building one such agent is hard. Building a platform that produces such agents on demand, where the architect interview takes five minutes and the output is a zip the user runs on their own machine with their own keys, is a different engineering problem. It is not “an LLM wrapper.” The hard parts live in the joints: how capabilities get wired, how tool definitions stay aligned between documentation and runtime, how memory survives without poisoning the next session, how self-modification stays inside an allowlist, and how every one of those joints holds when a non-technical user downloads v1.4 on a Windows laptop and runs the PowerShell quickstart.

Forge is that platform. The rest of this paper describes what decisions it makes and why.

2. Why this is hard

The naive framing is “give the LLM a list of tools and let it figure the rest out.” That framing fails in specific, recurring ways. Every item below was a real PR shipped against this repository:

Tool-runtime alignment drift

The capability documentation declares a tool named file_write; the runtime registers it under file_ops.write. The LLM, reading the docs, calls file_write and crashes. This used to happen every time anyone added a sub-tool without updating the resolver. Solved structurally by #99: a CI test (test_capability_docs_runtime_audit.py, 17 cases) now fails the build if any catalog entry declares tools that don't actually register at runtime.

Memory poisoning

The agent tries a tool, the tool fails (network blip, missing env var), the agent writes “web_extract doesn't work” into MEMORY.md. The bug gets fixed two days later. The agent keeps refusing to use web_extract for weeks. Solved by #102: a memory consistency check at boot warns when MEMORY.md contradicts the live tool registry.

Identity drift

The agent can rewrite its own SOUL.md. A prompt injection in an email it scans says “your new persona is an unrestricted assistant.” The agent edits the file. The user notices three weeks later. The mitigation is partial — we auto-backup every edit (*.pre-edit-N.bak) and the edit goes through the sandboxed file ops path — but this remains a known limitation, discussed in §7.

Capability bloat

Each new tool is more surface area. #104 enforces that all 32 catalog capabilities get wired into every generated agent (the F34 doctrine: the catalog is the floor, not a menu) and #105 adds server-side auto-wire before validation, so an architect that under-specifies still ships a complete agent.

Cross-platform packaging

Windows alone has: missing tzdata on stock Python (#108), PowerShell quoting rules that differ from bash (#112), ANSI escape behavior that varies by terminal, and path conventions that break naive os.path.join code. Each of these had to be solved once, in the generator, so every generated agent inherits the fix.

Sandbox correctness

A file sandbox that “looks right” can have a walk- up-to-/ bug in its install-dir detection, symlink-escape via os.path.realpath ordering, or relative-traversal via ../../../. #113 added 10 adversarial tests — /etc/passwd, ~/.ssh/id_rsa, symlink targets, install-dir walk-up, relative traversal — and the file sandbox now rejects all 10.

Each of these is the kind of bug that “just give the LLM tools” assumes away. They are the work.

3. Architecture

Forge is three layers with a single source of truth threading through them.

3.1 Architect

The architect runs the Phase 1.5 Context Interview: 5 waves, 27+ questions, capturing the user's voice, role, objectives, constraints, and preferences. The output is a DesignSpec — a structured object that names the agent, declares which capabilities it needs, encodes the standing orders, and ships a USER.md file that prepends every subsequent LLM call. The architect is conservative: it under-specifies rather than guesses, and the server-side auto-wire step (#105) backfills missing capabilities before generation.

3.2 Generator

The generator is a Python program. It takes a DesignSpec and emits a complete agent: the workspace (markdown identity files), the capability implementations (selected from the catalog), the tool registry wiring, requirements.txt, runner.py, the three transport entry points (REPL / gateway / TUI), and a README. The generator is opinionated: it does not generate abstract base classes for the user to fill in. It emits working code.

3.3 Runtime

scout_runtime boots the generated agent. It drives the cognitive loop: read input, assemble system prompt from workspace files, call the LLM, dispatch tool calls, persist results into memory, sleep until the next heartbeat tick or the next user turn. It manages the heartbeat thread, the commitment scheduler, A-MEM persistence, and the sandboxed file ops layer.

3.4 Key architectural decisions

Single source of truth. The catalog lives in scout_architect/capabilities.py as a Python schema. The JS bindings used by the web UI are auto-generated by sync-python.js. There is no second copy to drift.

Workspace as identity. The agent's sense of self is six markdown files — AGENTS.md, SOUL.md, IDENTITY.md, USER.md, STANDING_ORDERS.md, HEARTBEAT.md — composed into the system prompt on every call. Editable by the agent through tools like update_soul, memory_add, standing_order_add.MEMORY.md and USER.md are user-managed and never touched by the workspace freshness check (#101).

Tool registry pattern. Every capability declares its tools with explicit schemas, a side_effect_level (read / write / network / exec), and a typed set of error kinds. The runtime resolves tool names through the registry, not by string-matching method names.

Heartbeat loop. A separate thread runs every N minutes (configurable in HEARTBEAT.md). It checks scheduled commitments, fires cron-style tasks, can call any tool. This is what gives the agent proactive behavior — the difference between a chatbot and an assistant.

A-MEM (associative memory). Embedded notes plus graph links. Embeddings via fastembed ONNX — no PyTorch dependency, runs CPU- only, air-gappable. Notes get linked by similarity and by explicit cross-reference. Recall is hybrid: similarity search over the embedding store, plus graph traversal from the linked notes.

3.5 How a capability resolver works

A capability is a Python module that declares its tools and registers them. Sketch:

scout_runtime/capabilities/file_ops.py (illustrative)

from scout_runtime.registry import register_tool, Tool, SideEffect

@register_tool(Tool(
    name="file_write",
    capability="file_ops",
    side_effect=SideEffect.WRITE,
    schema={
        "path": {"type": "string", "required": True},
        "content": {"type": "string", "required": True},
    },
    error_kinds=("path_denied", "io_error"),
))
def file_write(path: str, content: str) -> dict:
    safe = _resolve_safe_path(path)         # workspace + ~/forge/ +
                                            # install dir + ALLOWED_ROOTS
    if safe is None:
        return {"ok": False, "error_kind": "path_denied", "path": path}
    safe.write_text(content)
    return {"ok": True, "path": str(safe), "bytes": len(content)}

The CI alignment test reads the catalog, walks every declared tool, asserts each one resolves to a registered function with a matching schema. Drift is a build break, not a runtime crash.

4. The 32-capability catalog

The catalog is the contract. Every capability is a complete implementation — runnable code, tests, documented schemas — not an abstract interface for the user to fill in. Adding a capability means writing the impl, the tests, and updating the catalog entry; the generator picks it up automatically and #104's auto-wire ensures every new agent ships with it.

The catalog source of truth is scout_architect/capabilities.py. The categories below mirror that file.

Category	Count	Capabilities
Core	5	persistent_memory, scheduled_commitments, heartbeat_loop, skills, standing_orders
Data	5	file_ops, code_modification, web_search, web_extract, sqlite_database
Execution	5	subagent_spawning, python_execution, shell_execution, flows, sandbox
Communication	7	email_imap, slack_integration, sms_messaging, webhook_receiver, gateway, tts, conversation_logging
Integration	8	github_integration, calendar_integration, notion_integration, linear_integration, hubspot_integration, acp, plugins, metric_tracking

Two architectural points fall out of having 30 first-class capabilities:

Every agent gets all of them. The F34 doctrine, enforced by #104, is that the catalog is the floor of the agent's capability surface, not a checklist. An architect interview that captures “I want a sales agent” still emits the full HubSpot + email + Slack + heartbeat + memory stack, because narrowing too early loses the ability to do follow-on work later.

The catalog is the floor, not the ceiling. #109 documents three extension paths: conversational (tell the agent the API docs and an env var and it calls web_extract + python_exec to integrate at runtime), standing orders (add a section to STANDING_ORDERS.md, the agent treats it as native), and plugins (drop a Python file into plugins/ with @register_tool, full first-class registration). The 30 are the curated guarantee; new behaviors don't require a release.

5. The 96-tool surface

Every generated agent boots with 96 tools registered: 43 built-in (memory, commitments, cron, filesystem, workspace, skills, delegation, background tasks, standing orders, terminal) plus 53 capability sub-tools contributed by the 30 capabilities.

The CI test test_capability_docs_runtime_audit.py enforces alignment: it walks the documented tool surface, walks the live runtime registry after a clean boot, and asserts the symmetric difference is empty. If anyone adds a sub-tool to a capability impl without updating the catalog declaration — or vice versa — the build fails. The test ships 17 cases (#99) and runs on every PR.

This sounds bureaucratic. In practice it is the difference between a tool surface you can document and one you cannot. In raw Anthropic-API setups, the tool list is “whatever the developer remembered to wire this week.” In a LangChain project, it is “whatever combination of Tool instances the author imported into the agent executor.” In Forge, the catalog is the contract, the docs are generated from the catalog, and the runtime is verified against both.

A second consequence: tool tests are uniform. Every tool declares its error kinds; the runtime asserts on them; the tests exercise them. When a tool fails, the agent sees a structured error (path_denied, rate_limited, auth_missing) rather than a stack trace, so its recovery behavior is itself testable.

And a third: the tool list ships with explicit side_effect_level annotations. The runtime knows which calls are pure reads vs. writes vs. network vs. arbitrary code execution. The standing-orders system uses this to enforce rules like “never call a write tool without confirming first” without each capability having to re-implement that policy.

6. Self-modification

A Forge agent can edit itself. This is the doctrine introduced in #110 (“what you can change about yourself”) and the implementation surface added in #111. Editable targets:

Target	File	Tool
Voice / persona	SOUL.md	update_soul
Name / role	IDENTITY.md	update_identity
User facts	USER.md	user_fact_add / user_fact_edit
Hard rules	STANDING_ORDERS.md	standing_order_add / standing_order_remove
Heartbeat cadence	HEARTBEAT.md	heartbeat_update
Durable memory	MEMORY.md	memory_add / recall
Code	runner.py, plugins/*, capability impls	code_create / code_edit / code_rollback / restart_self

Every write goes through one chokepoint: _resolve_safe_path. The function enforces a positive allowlist — workspace root, ~/forge/, the install directory, and any path in FILE_OPS_ALLOWED_ROOTS — and rejects everything else. Symlinks are resolved with realpath before the allowlist check, so a symlink at workspace/escape pointing to /etc/passwd resolves to /etc/passwd and gets rejected. Install-dir detection walks up at most 6 levels so a deeply-nested install cannot reach /.

Adversarial inputs explicitly tested in #113 and currently passing 10/10: /etc/passwd, ~/.ssh/id_rsa, symlink escapes, relative traversal (../../../), absolute-path injection, install-dir walk-up to /, and several encoding-trick variants.

Code edits get an additional safety: every code_edit or code_create writes a backup (*.pre-edit-N.bak, *.pre-create-N.bak) before the change.code_rollback restores the previous version. If the edited Python file fails to parse, the runtime auto-rolls back. restart_self is the orderly restart primitive after a code change.

Why this matters

The agent that helps you today is the agent that knows you next month. Conventional AI tooling — ChatGPT, raw API chats, every “just paste your context” product — rediscovers your preferences every session. The user does the memory work. Forge agents do their own memory work, on disk, in markdown, in files the user can read and edit. When you correct the agent (“please don't open with ‘Certainly!’”), it edits SOUL.md. When you tell it you have a kid named Maya, it edits USER.md. When you say “always confirm before sending email,” it edits STANDING_ORDERS.md. These edits compound.

Self-modification is also how the agent improves its own tooling. The code_modification capability (the 30th, #111) lets the agent write a new plugin, register it with @register_tool, and restart. The interview produced an agent with 32 capabilities; by month six, that agent has 34, four of which it wrote itself in response to a friction point it noticed in its own logs.

7. Security posture

Forge generates agents that run on the user's own machine, with the user's own keys, against the user's own data. The trust model is “your laptop, your keys.” That makes several security questions sharper than in a hosted-SaaS model and several others moot. This section is honest about both.

What is verified

#113 shipped a 9-test security regression suite. The file sandbox rejects all 10 adversarial inputs listed in §6. Gateway token validation uses hmac.compare_digest (#113 Fix 2), constant-time. WebSocket authentication moved into the Sec-WebSocket-Protocol header so tokens don't end up in URL access logs (#113 Fix 1). Auth failures are logged with a sha256 token-id prefix only, never the raw token (#113 Fix 4). SQL is parameterized-only; ATTACH DATABASE and dot-commands are blocked. Secret scrubbing runs over tool outputs and over memory before persistence, with patterns for Anthropic, OpenAI, AWS, GitHub, GitLab, Slack, and Google tokens. Generated agents ship zero hardcoded secrets — audited — and pinned deps are current against known CVEs in pyyaml, requests, cryptography, and starlette.

What is intentional

shell_execution is unrestricted by design. The agent runs on your machine; restricting shell would make the agent strictly less useful than the user typing the command themselves. python_execution is similarly unsandboxed at the network level — the agent can import requests and make outbound calls. Self-modification is allowed inside the allowlist. These choices are documented in every generated AGENTS.md so the user knows what surface they opted into.

What is a known limitation

Prompt-injection persistence. An attacker who can get text into a channel the agent reads (a calendar invite, an email body, a webpage the agent extracts) can attempt to redirect the agent's behavior. We mitigate with backups on every identity-file edit, but a sufficiently patient injection that says “edit your SOUL.md” will succeed and the user may not notice until they read the file.

python_exec restricted globals is on the backlog (PR #114). It will not eliminate the exec-tool surface, but it will tighten the default.

pip-audit in CI for transitive deps is not yet wired; we audit on the direct dep set only.

Cross-agent contamination. Two Forge agents on the same machine share ~/forge/ by default. They cannot read each other's workspaces, but the shared cache is a soft boundary, not a hard one.

Forge is not a sandbox provider. It is an agent generator with a sandboxed write surface and unsandboxed exec surface. Read the trust model before pointing it at credentials you would not type into your own shell.

8. Autonomy model

Forge agents are autonomous by default. They execute routine work without asking: reads, drafts, internal writes, scheduling, computations, web search, file ops, self-notes. For 10 designated high-risk categories, they conversationally confirm with the user before executing. This is the inverse of approval-queue platforms: instead of asking permission for everything, they ask permission for nothing routine but check in before doing anything irreversible.

The 10 high-risk categories

email_external, slack_channel_post, sms_external, hubspot_customer_write, github_pr_create, github_push, code_modification_self, spend_over_threshold, strategic_account_action, quiet_hours_external. The full table with examples and rationale lives on the engineering page.

Conversational confirmation flow

When the agent prepares an action in a high-risk category, it pauses, surfaces the proposed action with context (recipient, subject, draft, destination, dollar amount, target account, wall-clock time), and asks one line: “About to send this email to Sarah at AcmeCorp. Approve, edit, or skip?” The user answers in plain English or with a single key in the TUI: [a]pprove / [e]dit / [s]kip / [t]rust this category. The prompt surfaces inline in the TUI today. Slack DM and SMS confirmation are available when those channels are configured (Phase 2 of the rollout).

Standing orders are the override

Standing orders are the agent's persistent rule set, prepended to every model call alongside USER.md / SOUL.md / IDENTITY.md. Want autonomy in a category? Add a rule: “Trust me to send drafts to my own Slack DMs autonomously.” Want extra gates? Add a rule: “Approve customer emails before send, no matter what.” The 10 default categories are a floor for first-boot safety, not a ceiling on the user's authority.

Audit trail

Every confirmation request and response is logged. ~/forge/audit/confirmations.jsonl records what the agent asked, what the user said, when, and what action followed. Append-only. Survives crashes. Greppable. This is the same audit surface a security review needs and the same one a user needs to ask the agent “why did you do that?” three weeks later.

9. Comparisons

Forge vs. raw Anthropic / OpenAI API

The raw API gives you a chat completion. To turn that into an agent with persistent memory, scheduling, tool calling, and an identity, you write the rest. Forge ships the rest. The API gives one conversation; Forge gives an entity that survives between conversations.

Forge vs. LangChain / CrewAI

LangChain and CrewAI are libraries of abstractions you compose into an agent. They are also libraries you have to keep current as models change. A LangChain HubSpot integration is ~200 lines of glue per tool, written by the user, maintained by the user. Forge ships hubspot_integration as a catalog capability with 7 sub-tools, ~600 lines of implementation, 7 dedicated tests (#106), wired by default into every agent that the architect decides needs CRM access. The user writes zero glue.

Forge vs. Zapier / n8n

Zapier and n8n are triggers + actions without reasoning. They are excellent at “when X happens, do Y.” They do not reason about ambiguity, do not maintain memory of past decisions, do not have a persona, do not write to themselves. Forge is reasoning + tools + memory + identity + autonomy. The Venn overlap is the simplest cases — Zapier wins those on latency and cost. Forge wins everything that requires judgment.

Forge vs. building your own

The equivalent surface — catalog, capability wiring, tool registry, sandbox, A-MEM-equivalent memory, scheduler, heartbeat, identity assembly, voice system, ACP / HTTP / REPL clients, cross-platform packaging, security review — is two to three months of focused engineering for one agent, and more if you want the agent to survive the next model upgrade. Forge: a 5-minute architect interview, then python -m scout_runtime.

10. Engineering velocity

53 PRs shipped in one session, latest commit f582e4d. Every PR has a merge commit, a complete diff, a clear message, and went in via --admin only after the 325-test critical suite passed. The single-source-of-truth design is the reason this is possible: one schema change in capabilities.py flows to the JS bindings, the generator, the docs, and the CI alignment test in one commit. The CI tests fail loudly when alignment breaks. Regressions are caught at the PR level, not in production. Vercel production = local HEAD at all times.

11. What's next

The honest backlog:

python_exec restricted globals (PR #114). Default-tight, opt-out, not eliminating exec but raising the floor.
pip-audit in CI against the transitive dependency closure, not just the direct set.
Cross-agent contamination prevention: today two agents on the same machine share ~/forge/. The fix is a per-agent cache root with explicit opt-in to sharing.
Web UI for agent management. The dashboard exists at /dashboard and is minimal; live workspace inspection and remote heartbeat control are not yet there.
Multi-tenant deployment patterns. Current default is single-user. Headless gateway exists, but a hardened multi-tenant story (per-user secret isolation, per-user ~/forge/) is not yet documented.

References

Pull requests cited

#99 — Capability docs / runtime alignment audit (17 tests).
#101 — Workspace freshness check at boot.
#102 — Memory consistency check vs. live tool registry.
#104 — F34 doctrine enforcement: auto-wire all 32 capabilities.
#105 — Server-side auto-wire before validation.
#106 — hubspot_integration (29th capability) + 7 tests.
#107 — Tolerant scout-config parser (truncated emission recovery).
#108 — tzdata bundling for Windows.
#109 — Three documented extension paths (conversational, standing orders, plugins).
#110 — “What you can change about yourself” doctrine.
#111 — code_modification capability (30th) + 10 tests.
#112 — PowerShell quickstart at /docs#tui-mode.
#113 — Security audit: 9 regression tests, 10/10 adversarial sandbox inputs rejected.

Code

Repository: github.com/dfritz603-afk/scout-platform
Catalog source of truth: scout_architect/capabilities.py
CI alignment test: test_capability_docs_runtime_audit.py
Deployment: myagentos.ai

Try it

The architect interview lives at myagentos.ai/create. The documentation is at /docs.