SRE Runbook Generator

Note from Alex: This is a scaffold. Sections marked TODO are where the real technical specifics and outcomes go. The shape is right — fill in the substance when there's time.

The problem this solves

TODO: 2–3 paragraphs setting up the pain. Beats to hit:

Multi-cloud GenAI platforms at Merck — the surface area no single SRE knows cold.
The pager-at-3am moment: alert fires, on-call has to find the runbook, the runbook is stale or missing, the related postmortems are in a different system, time-to-first-action stretches.
"What if the runbook just generated itself, freshly, from the alert and the relevant historical context?"

Architecture

TODO: cover the major pieces:

Inputs — alert definitions (Prometheus / Grafana Alerting / Nobl9 alert specs), historical postmortems (ServiceNow KB? Confluence? Notion? markdown in git?), service catalog metadata
Context assembly — how related postmortems get retrieved (RAG? tag-based lookup? both?), how the alert spec gets parsed
LLM — which model (and why — this is a "low frequency, high stakes" use case where quality matters more than cost)
Output format — markdown? structured JSON the PagerDuty UI can consume? Both?
Where it runs — on-demand at alert time, or pre-generated and refreshed? Latency budget?

Architecture diagram here would be high-leverage.

What's working

TODO: bullets on shipped behavior:

Generation time: X seconds end-to-end
Coverage: X% of active alerts have generated runbooks
Quality bar: SRE on-call rates them "useful" Y% of the time (if you have eval data)
Adoption: who's using it, how often

What's still hard

TODO: honest section on open problems:

Stale postmortems poisoning context — how to filter?
Hallucinated commands (the SRE worst case — runbook says "run kubectl delete pod -n prod ..." and it's wrong). Mitigation?
New services with no postmortem history — what does the system do when there's nothing to retrieve?
Trust calibration — how does the SRE know when to trust the runbook vs. fall back to first principles?

Why this matters beyond Merck

TODO: the generalization paragraph. What's the broader insight? Examples:

LLM-generated runbooks as a category — what's the right shape across orgs?
Postmortem corpus as ML training data — implications for incident review hygiene
The cost calculus of LLM-at-incident-time vs. pre-generated cached runbooks
How this informs the "AI for SREs" conversation generally

Tech stack

TODO: replace with actual stack:

Language: Python (probably)
LLM:
RAG store:
Trigger: cron? on-alert? both?
Integration points: PagerDuty? ServiceNow? Slack?
Hosting: where it runs at Merck

Adoption + impact

TODO: if there are real numbers, drop them here. Examples:

Time-to-first-action: down from X min to Y min
Postmortems referenced per incident: up from N to M
Engineer-hours saved per quarter: estimate

If no numbers yet, say so — "rolled out to N teams, eval in progress" is honest and fine.

Status

Internal Merck tooling — not open source. Patterns are generalizable and worth writing about; specifics are not for public sharing.