Loading...

Claude Code for On-Call Ops

When a service breaks in the middle of the night, engineers often wake up to a vague page that simply says “Service is broken.” To address this, Shrivu Shankar developed Claude Code for On-Call Ops, a tool that automatically connects to CloudWatch and Prometheus, executes the correct queries, and generates a clear diagnostic report. The result: engineers can respond quickly without having to juggle multiple dashboards or query syntaxes.

Shrivu Shankar

September 30, 2025

Shrivu x On Call thumbnail v6 alt

NOTE: Demo visuals use either blurred real data or synthetic placeholders to protect customer privacy.

On-Call Pain Points at Night

Shrivu started from a real frustration: alerts are usually too generic to be actionable. That forces engineers into a scavenger hunt through monitoring tools, logs, and runbooks, often under time pressure.

  • Alerts are vague, leaving engineers unsure where to start.

  • Logs and metrics live in multiple systems, each with unique query formats.

  • Dashboards and runbooks are hard to locate when every minute matters.

These gaps slow recovery and add stress, especially for newer engineers carrying the pager.

Direct Path from Issue to Cause

Claude Code for On-Call Ops shortens that workflow. Instead of pulling up multiple tools, engineers can simply point it to a service name. The system automatically identifies the correct logs, executes queries in their native syntax, and presents the results, as Shrivu explained, “I don’t even tell it where to look. That’s actually a huge thing of just knowing that this name corresponds to something specific pieces of infrastructure.”

Key capabilities include:

  • Fetching and parsing CloudWatch logs by service identity

  • Querying Prometheus counters in their native syntax

  • Detecting recurring errors and tallying frequency

  • Summarizing the impact and root cause for engineers

  • Suggesting fixes or even drafting PRs to resolve issues

Prometheus Queries

Terminal showing Claude running MCP Prometheus queries to fetch agent runner metrics, tasks, and error rates.

Faster Diagnosis, Calmer On-Call

Shrivu has been dogfooding the solution in his own on-call shifts. The results: he often doesn’t need dashboards or runbooks at all. By cutting straight to diagnosis, he spends less time interpreting vague alerts and more time fixing issues.

Benefits observed so far:

  • Significant reduction in manual querying across tools

  • Easier onboarding for junior engineers to handle incidents

  • Consolidated view of errors, their frequency, and impact

  • Ability to move directly into fixes without setup overhead

The logical next step is to weave this into standard on-call playbooks, so every engineer starts with a prebuilt diagnostic summary instead of a blank screen.

Early Users See Real Leverage

Peers testing Claude Code for On-Call Ops noted how much cognitive load it removes. One early user shared that they could bypass dashboards entirely and jump into resolving the issue.

This reflection signals cultural fit: engineers embrace AI that takes work off their plate. The next step is scaling adoption across teams, aligning with Abnormal’s mission to give engineers leverage during their most stressful moments.

Problem

On-call triage is slow and mentally draining when customer issues strike.

Solution

Claude Code auto-aggregates logs and metrics into a clear summary.

Why It’s Cool

Engineers cut triage time from 30 minutes to minutes, with less stress.

Technologies used:

  • Claude
Loading...