All projects
Internal2026· Designer + engineer

SRE Runbook Generator

Internal tooling that generates incident runbooks from alert definitions and historical postmortems — drops time-to-first-action when an SRE gets paged into a system they don't know cold.

PythonLLMSRE ToolingAutomation

Note from Alex: This is a scaffold. Sections marked TODO are where the real technical specifics and outcomes go. The shape is right — fill in the substance when there's time.

The problem this solves

TODO: 2–3 paragraphs setting up the pain. Beats to hit:

  • Multi-cloud GenAI platforms at Merck — the surface area no single SRE knows cold.
  • The pager-at-3am moment: alert fires, on-call has to find the runbook, the runbook is stale or missing, the related postmortems are in a different system, time-to-first-action stretches.
  • "What if the runbook just generated itself, freshly, from the alert and the relevant historical context?"

Architecture

TODO: cover the major pieces:

  • Inputs — alert definitions (Prometheus / Grafana Alerting / Nobl9 alert specs), historical postmortems (ServiceNow KB? Confluence? Notion? markdown in git?), service catalog metadata
  • Context assembly — how related postmortems get retrieved (RAG? tag-based lookup? both?), how the alert spec gets parsed
  • LLM — which model (and why — this is a "low frequency, high stakes" use case where quality matters more than cost)
  • Output format — markdown? structured JSON the PagerDuty UI can consume? Both?
  • Where it runs — on-demand at alert time, or pre-generated and refreshed? Latency budget?

Architecture diagram here would be high-leverage.

What's working

TODO: bullets on shipped behavior:

  • Generation time: X seconds end-to-end
  • Coverage: X% of active alerts have generated runbooks
  • Quality bar: SRE on-call rates them "useful" Y% of the time (if you have eval data)
  • Adoption: who's using it, how often

What's still hard

TODO: honest section on open problems:

  • Stale postmortems poisoning context — how to filter?
  • Hallucinated commands (the SRE worst case — runbook says "run kubectl delete pod -n prod ..." and it's wrong). Mitigation?
  • New services with no postmortem history — what does the system do when there's nothing to retrieve?
  • Trust calibration — how does the SRE know when to trust the runbook vs. fall back to first principles?

Why this matters beyond Merck

TODO: the generalization paragraph. What's the broader insight? Examples:

  • LLM-generated runbooks as a category — what's the right shape across orgs?
  • Postmortem corpus as ML training data — implications for incident review hygiene
  • The cost calculus of LLM-at-incident-time vs. pre-generated cached runbooks
  • How this informs the "AI for SREs" conversation generally

Tech stack

TODO: replace with actual stack:

  • Language: Python (probably)
  • LLM:
  • RAG store:
  • Trigger: cron? on-alert? both?
  • Integration points: PagerDuty? ServiceNow? Slack?
  • Hosting: where it runs at Merck

Adoption + impact

TODO: if there are real numbers, drop them here. Examples:

  • Time-to-first-action: down from X min to Y min
  • Postmortems referenced per incident: up from N to M
  • Engineer-hours saved per quarter: estimate

If no numbers yet, say so — "rolled out to N teams, eval in progress" is honest and fine.

Status

Internal Merck tooling — not open source. Patterns are generalizable and worth writing about; specifics are not for public sharing.