Incident Commander
PagerDuty fires at 3am. Most SREs spend the first 5 minutes opening a thread, posting the triage checklist they have memorized, and waiting for someone else on the team to acknowledge. Sentinel does that 5 minutes in 5 seconds.
What it does
When an alert lands (email from PagerDuty/Datadog/Sentry, or a webhook from your forwarder), Sentinel:
- Opens a triage Telegram thread. Posts the alert, sev classification, and your team's triage checklist.
- Identifies the on-call. Pings them by name; mentions the secondary if no acknowledgment in 3 minutes.
- Runs diagnostic queries. SSH to the affected host, runs the read-only commands you've pre-approved (disk space, last 100 log lines, process list, recent deploys). Posts results in the thread.
- Surfaces prior incidents. "This signature looks like incident #428
from Feb 14 — root cause was config drift on
auth-svc. Want the postmortem?" - Learns. When the human resolves the incident and notes what fixed
it, Sentinel writes it to
MEMORY.md. Next time this signature fires, it tells you what worked last time.
Why this agent shape
- Continuous operation — alerts can fire any time; Sentinel never sleeps and never misses one.
- Multi-channel — alert sources (email, webhook) feed in; triage threads (Telegram) go out. The agent translates between channel formats.
- Sandbox — SSH diagnostic queries run inside scout_runtime's sandbox
with explicit approval gates. The agent can't
rm -rfyour prod; it can rundf -handtail -100 /var/log/syslog.
What you'll configure
After picking this template, the architect will ask you to:
- Set up the alert pipeline. Forward PagerDuty/Datadog/Sentry to a dedicated email or webhook endpoint. Sentinel watches that.
- Define your on-call rotation. Sentinel needs to know who's primary right now. Can read from a Google Calendar or a JSON file.
- Pre-approve diagnostic queries. List of read-only commands the agent may run without per-incident approval. Default set covers disk, memory, recent logs, last deploy timestamps.
- Configure your triage checklist. Plain English: "always check recent deploys, then disk, then upstream dependencies, then customer reports." Sentinel renders this as a thread template per incident.
What it looks like running
[Telegram thread, 3:14am]
🔥 Sev-1 — auth-svc DOWN
PagerDuty alert at 3:14am EST.
Signature: "auth-svc /healthz returning 503 for 5min"
On-call primary: @dan (you)
Secondary: @sarah (paging at 3:17 if no ack)
— Triage checklist —
[ ] Recent deploy? Last deploy to auth-svc: 2:47am EST (27 min ago)
[ ] Disk on auth-svc-prod-01: 94% used (df -h posted below)
[ ] Memory: 87% used, no OOM kills in last 1h
[ ] Upstream: db.prod healthy, redis.prod healthy
[ ] Customer reports: none yet (1 internal slack ping)
— Diagnostics auto-run —
$ df -h /var
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 100G 94G 6.0G 94% /var
$ tail -20 /var/log/auth-svc/error.log
[3:11:42] disk write failed: ENOSPC
[3:11:43] disk write failed: ENOSPC
... (17 more)
— Prior incident match —
This signature looks like #428 (Feb 14). Root cause was log rotation
broken after a deploy added verbose logging. Postmortem fix: `logrotate`
config update + alert on `/var` > 90%.
Want me to run the logrotate fix? (read-only by default; this requires
escalated approval)
When this template isn't right
- You don't want an agent in your incident loop — fair. Sentinel is for teams who'd rather have a checklist-runner than a quiet alert.
- Your alerts come from a custom source not via email/webhook — the architect can add custom monitor types; expect a setup session.
- You don't trust agents anywhere near production — Sentinel runs read-only by default and asks before any state change. But if "agent has SSH" is a non-starter for your security posture, skip this one.
What you can ask the architect to change
- "Slack instead of Telegram for triage threads" — channel swap
- "Run on local model — we don't want incident data going to Anthropic" — air-gap mode (with caveats: alert ingestion still touches your email/webhook surface)
- "Wire to our incident management system (Incident.io / FireHydrant)" — add output channels for those tools
- "Auto-create a Linear ticket on every sev-2+" — add an action plugin