Incident Commander

PagerDuty fires at 3am. Most SREs spend the first 5 minutes opening a thread, posting the triage checklist they have memorized, and waiting for someone else on the team to acknowledge. Sentinel does that 5 minutes in 5 seconds.

What it does

When an alert lands (email from PagerDuty/Datadog/Sentry, or a webhook from your forwarder), Sentinel:

Opens a triage Telegram thread. Posts the alert, sev classification, and your team's triage checklist.
Identifies the on-call. Pings them by name; mentions the secondary if no acknowledgment in 3 minutes.
Runs diagnostic queries. SSH to the affected host, runs the read-only commands you've pre-approved (disk space, last 100 log lines, process list, recent deploys). Posts results in the thread.
Surfaces prior incidents. "This signature looks like incident #428 from Feb 14 — root cause was config drift on auth-svc. Want the postmortem?"
Learns. When the human resolves the incident and notes what fixed it, Sentinel writes it to MEMORY.md. Next time this signature fires, it tells you what worked last time.

Why this agent shape

Continuous operation — alerts can fire any time; Sentinel never sleeps and never misses one.
Multi-channel — alert sources (email, webhook) feed in; triage threads (Telegram) go out. The agent translates between channel formats.
Sandbox — SSH diagnostic queries run inside scout_runtime's sandbox with explicit approval gates. The agent can't rm -rf your prod; it can run df -h and tail -100 /var/log/syslog.

What you'll configure

After picking this template, the architect will ask you to:

Set up the alert pipeline. Forward PagerDuty/Datadog/Sentry to a dedicated email or webhook endpoint. Sentinel watches that.
Define your on-call rotation. Sentinel needs to know who's primary right now. Can read from a Google Calendar or a JSON file.
Pre-approve diagnostic queries. List of read-only commands the agent may run without per-incident approval. Default set covers disk, memory, recent logs, last deploy timestamps.
Configure your triage checklist. Plain English: "always check recent deploys, then disk, then upstream dependencies, then customer reports." Sentinel renders this as a thread template per incident.

What it looks like running

[Telegram thread, 3:14am]
🔥 Sev-1 — auth-svc DOWN

PagerDuty alert at 3:14am EST.
Signature: "auth-svc /healthz returning 503 for 5min"
On-call primary: @dan (you)
Secondary: @sarah (paging at 3:17 if no ack)

— Triage checklist —
[ ] Recent deploy? Last deploy to auth-svc: 2:47am EST (27 min ago)
[ ] Disk on auth-svc-prod-01: 94% used (df -h posted below)
[ ] Memory: 87% used, no OOM kills in last 1h
[ ] Upstream: db.prod healthy, redis.prod healthy
[ ] Customer reports: none yet (1 internal slack ping)

— Diagnostics auto-run —
$ df -h /var
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  100G   94G  6.0G  94%  /var
$ tail -20 /var/log/auth-svc/error.log
[3:11:42] disk write failed: ENOSPC
[3:11:43] disk write failed: ENOSPC
... (17 more)

— Prior incident match —
This signature looks like #428 (Feb 14). Root cause was log rotation
broken after a deploy added verbose logging. Postmortem fix: `logrotate`
config update + alert on `/var` > 90%.

Want me to run the logrotate fix? (read-only by default; this requires
escalated approval)

When this template isn't right

You don't want an agent in your incident loop — fair. Sentinel is for teams who'd rather have a checklist-runner than a quiet alert.
Your alerts come from a custom source not via email/webhook — the architect can add custom monitor types; expect a setup session.
You don't trust agents anywhere near production — Sentinel runs read-only by default and asks before any state change. But if "agent has SSH" is a non-starter for your security posture, skip this one.

What you can ask the architect to change

"Slack instead of Telegram for triage threads" — channel swap
"Run on local model — we don't want incident data going to Anthropic" — air-gap mode (with caveats: alert ingestion still touches your email/webhook surface)
"Wire to our incident management system (Incident.io / FireHydrant)" — add output channels for those tools
"Auto-create a Linear ticket on every sev-2+" — add an action plugin

Incident Commander

Incident Commander

What it does

Why this agent shape

What you'll configure

What it looks like running

When this template isn't right

What you can ask the architect to change

See it in action

Ready to deploy this agent?