Template

Incident Commander

Ingests alerts from PagerDuty/Datadog/Sentry, opens triage threads in Telegram, runs diagnostic queries via sandboxed SSH, and learns from postmortems.

ContinuousMulti-channelSandbox

For: SRE/devops teams and on-call rotations running production systems.

Incident Commander

PagerDuty fires at 3am. Most SREs spend the first 5 minutes opening a thread, posting the triage checklist they have memorized, and waiting for someone else on the team to acknowledge. Sentinel does that 5 minutes in 5 seconds.

What it does

When an alert lands (email from PagerDuty/Datadog/Sentry, or a webhook from your forwarder), Sentinel:

  1. Opens a triage Telegram thread. Posts the alert, sev classification, and your team's triage checklist.
  2. Identifies the on-call. Pings them by name; mentions the secondary if no acknowledgment in 3 minutes.
  3. Runs diagnostic queries. SSH to the affected host, runs the read-only commands you've pre-approved (disk space, last 100 log lines, process list, recent deploys). Posts results in the thread.
  4. Surfaces prior incidents. "This signature looks like incident #428 from Feb 14 — root cause was config drift on auth-svc. Want the postmortem?"
  5. Learns. When the human resolves the incident and notes what fixed it, Sentinel writes it to MEMORY.md. Next time this signature fires, it tells you what worked last time.

Why this agent shape

  • Continuous operation — alerts can fire any time; Sentinel never sleeps and never misses one.
  • Multi-channel — alert sources (email, webhook) feed in; triage threads (Telegram) go out. The agent translates between channel formats.
  • Sandbox — SSH diagnostic queries run inside scout_runtime's sandbox with explicit approval gates. The agent can't rm -rf your prod; it can run df -h and tail -100 /var/log/syslog.

What you'll configure

After picking this template, the architect will ask you to:

  1. Set up the alert pipeline. Forward PagerDuty/Datadog/Sentry to a dedicated email or webhook endpoint. Sentinel watches that.
  2. Define your on-call rotation. Sentinel needs to know who's primary right now. Can read from a Google Calendar or a JSON file.
  3. Pre-approve diagnostic queries. List of read-only commands the agent may run without per-incident approval. Default set covers disk, memory, recent logs, last deploy timestamps.
  4. Configure your triage checklist. Plain English: "always check recent deploys, then disk, then upstream dependencies, then customer reports." Sentinel renders this as a thread template per incident.

What it looks like running

[Telegram thread, 3:14am]
🔥 Sev-1 — auth-svc DOWN

PagerDuty alert at 3:14am EST.
Signature: "auth-svc /healthz returning 503 for 5min"
On-call primary: @dan (you)
Secondary: @sarah (paging at 3:17 if no ack)

— Triage checklist —
[ ] Recent deploy? Last deploy to auth-svc: 2:47am EST (27 min ago)
[ ] Disk on auth-svc-prod-01: 94% used (df -h posted below)
[ ] Memory: 87% used, no OOM kills in last 1h
[ ] Upstream: db.prod healthy, redis.prod healthy
[ ] Customer reports: none yet (1 internal slack ping)

— Diagnostics auto-run —
$ df -h /var
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  100G   94G  6.0G  94%  /var
$ tail -20 /var/log/auth-svc/error.log
[3:11:42] disk write failed: ENOSPC
[3:11:43] disk write failed: ENOSPC
... (17 more)

— Prior incident match —
This signature looks like #428 (Feb 14). Root cause was log rotation
broken after a deploy added verbose logging. Postmortem fix: `logrotate`
config update + alert on `/var` > 90%.

Want me to run the logrotate fix? (read-only by default; this requires
escalated approval)

When this template isn't right

  • You don't want an agent in your incident loop — fair. Sentinel is for teams who'd rather have a checklist-runner than a quiet alert.
  • Your alerts come from a custom source not via email/webhook — the architect can add custom monitor types; expect a setup session.
  • You don't trust agents anywhere near production — Sentinel runs read-only by default and asks before any state change. But if "agent has SSH" is a non-starter for your security posture, skip this one.

What you can ask the architect to change

  • "Slack instead of Telegram for triage threads" — channel swap
  • "Run on local model — we don't want incident data going to Anthropic" — air-gap mode (with caveats: alert ingestion still touches your email/webhook surface)
  • "Wire to our incident management system (Incident.io / FireHydrant)" — add output channels for those tools
  • "Auto-create a Linear ticket on every sev-2+" — add an action plugin

See it in action

Sample interactions illustrating how Sentinel actually behaves.

3:14am, PagerDuty alert fires

You

(no input — email channel triggered)

Agent

[Telegram] 🔥 Sev-1 alert: auth-svc DOWN. Opening triage thread.

On-call primary: @dan. Secondary @sarah pages at 3:17 if no ack.
Last deploy to auth-svc: 27min ago. Disk on auth-svc-prod-01: 94% used.
This signature matches incident #428 (Feb 14) — log rotation broken after deploy. Postmortem fix: logrotate config + alert on /var > 90%.

Running standard diagnostics now; results in 30s.

Mid-incident, the human asks for context

You

what changed in the deploy 27 min ago

Agent

Pulled the git log + deploy manifest:

Deploy d2c4f9a, 2:47am EST by @kev (CI auto-merge from main).
Commits in deploy (3):
- ec1d8a: enable verbose logging on auth-svc /healthz
- 8b9f2e: bump go-jwt dependency 4.5.0 → 4.5.1
- 1a3c0d: typo fix in error message

The verbose logging change matches incident #428's root cause exactly. Strong hypothesis: log volume + disk full at 94%.

Post-resolution, human marks the incident closed

You

/resolved — yeah it was the logging change, reverted ec1d8a and disk dropped to 41%

Agent

Logged to MEMORY.md:

"auth-svc disk-fill pattern: when /var hits 94% with ENOSPC errors in auth-svc/error.log, check the last 24h of deploys for verbose logging additions. The fix is reverting the verbose-logging commit. Saw this Feb 14 (incident #428) and Mar 12 (this one). If we see a 3rd, the lesson is to alert on /var > 75% post-deploy."

Incident #547 closed at 4:08am. MTTR: 54 minutes. Postmortem draft saved to memory/2024-03-12-auth-svc.md.

Ready to deploy this agent?

Picking this template loads the configuration above into the Forge Architect. You can refine anything via plain English before deploying.

Use this template