SRE Runbook Generator
Internal tooling that generates incident runbooks from alert definitions and historical postmortems — drops time-to-first-action when an SRE gets paged into a system they don't know cold.
Note from Alex: This is a scaffold. Sections marked
TODOare where the real technical specifics and outcomes go. The shape is right — fill in the substance when there's time.
The problem this solves
TODO: 2–3 paragraphs setting up the pain. Beats to hit:
- Multi-cloud GenAI platforms at Merck — the surface area no single SRE knows cold.
- The pager-at-3am moment: alert fires, on-call has to find the runbook, the runbook is stale or missing, the related postmortems are in a different system, time-to-first-action stretches.
- "What if the runbook just generated itself, freshly, from the alert and the relevant historical context?"
Architecture
TODO: cover the major pieces:
- Inputs — alert definitions (Prometheus / Grafana Alerting / Nobl9 alert specs), historical postmortems (ServiceNow KB? Confluence? Notion? markdown in git?), service catalog metadata
- Context assembly — how related postmortems get retrieved (RAG? tag-based lookup? both?), how the alert spec gets parsed
- LLM — which model (and why — this is a "low frequency, high stakes" use case where quality matters more than cost)
- Output format — markdown? structured JSON the PagerDuty UI can consume? Both?
- Where it runs — on-demand at alert time, or pre-generated and refreshed? Latency budget?
Architecture diagram here would be high-leverage.
What's working
TODO: bullets on shipped behavior:
- Generation time: X seconds end-to-end
- Coverage: X% of active alerts have generated runbooks
- Quality bar: SRE on-call rates them "useful" Y% of the time (if you have eval data)
- Adoption: who's using it, how often
What's still hard
TODO: honest section on open problems:
- Stale postmortems poisoning context — how to filter?
- Hallucinated commands (the SRE worst case — runbook says "run
kubectl delete pod -n prod ..." and it's wrong). Mitigation? - New services with no postmortem history — what does the system do when there's nothing to retrieve?
- Trust calibration — how does the SRE know when to trust the runbook vs. fall back to first principles?
Why this matters beyond Merck
TODO: the generalization paragraph. What's the broader insight? Examples:
- LLM-generated runbooks as a category — what's the right shape across orgs?
- Postmortem corpus as ML training data — implications for incident review hygiene
- The cost calculus of LLM-at-incident-time vs. pre-generated cached runbooks
- How this informs the "AI for SREs" conversation generally
Tech stack
TODO: replace with actual stack:
- Language: Python (probably)
- LLM:
- RAG store:
- Trigger: cron? on-alert? both?
- Integration points: PagerDuty? ServiceNow? Slack?
- Hosting: where it runs at Merck
Adoption + impact
TODO: if there are real numbers, drop them here. Examples:
- Time-to-first-action: down from X min to Y min
- Postmortems referenced per incident: up from N to M
- Engineer-hours saved per quarter: estimate
If no numbers yet, say so — "rolled out to N teams, eval in progress" is honest and fine.
Status
Internal Merck tooling — not open source. Patterns are generalizable and worth writing about; specifics are not for public sharing.