AIOps for Modern IT: Anomaly Detection, Root-Cause, and GenAI Runbooks—What Works in 2025 By CyberDudeBivash • September 21, 2025 (IST)

TL;DR

Outcomes, not magic: Good AIOps reduces noisy alerts by 60–90%, cuts MTTR, and automates the boring but critical fixes (cache flush, pod recycle, feature flag rollback).
Three pillars that actually work in 2025:
1. Anomaly detection that understands seasonality & SLOs (multi-signal, not single-metric).
2. Root-cause analysis (RCA) driven by topology + change events (deploys, configs, feature flags).
3. GenAI runbooks that generate step-by-step remediation and execute safely via guardrails + human-in-the-loop (HITL).
Reference stack: OpenTelemetry → Data Lake/TSDB → Correlation/RCA → GenAI Runbooks → ChatOps & SOAR.
Start small: Ship “auto-remediate with rollback” for top 5 failure modes; measure noise compression and toil hours saved weekly.

What AIOps means (in practice) in 2025

AIOps isn’t a product—it's a workflow:

Ingest everything: metrics, logs, traces, events, tickets, feature flags, deploys, configs, cloud bills.
Detect anomalies in context (service maps, SLOs, recent changes).
Correlate signals across layers (user impact → service → dependency → infra).
Explain cause: point to the most suspicious change/hop.
Generate a fix path: GenAI runbooks produce ordered steps with safety checks, then request approval (or auto-apply within guardrails).
Learn: capture outcome & feedback; update playbooks and detectors.

Reference architecture

Collection: OpenTelemetry (metrics/logs/traces), change feeds (Git/CI/CD), config & feature flags, incident/ticket data.
Storage/Processing: TSDB for time series; searchable log store; graph of services/dependencies; feature/config history.
Anomaly Engine: seasonal & robust detectors, cardinality-aware; correlates across signals and services.
RCA Engine: combines service topology + recent changes + blast radius to rank suspected causes.
GenAI Runbooks: RAG over your wiki/CMDB/playbooks; outputs structured steps; gated execution via SOAR/ChatOps.
Safety & Governance: guardrails (allowlists, rate limits, approval policies), audit trail, rollback.

Pillar 1 — Anomaly detection that respects reality

What works

Seasonality & baselines: weekly cycles, end-of-month spikes, release days. Use seasonal decomposition or robust forecasting to avoid “everything is red on Mondays.”
Multi-signal correlation: a single p95 latency blip is noise; latency + error rate + saturation + user complaints = signal.
SLO-aware alerts: detect only when error budget burn is abnormal, not when a noisy metric crosses a static threshold.
Cardinality control: group related labels, summarize per service/region to avoid detector overload.

Fast wins

Replace static CPU/latency thresholds with SLO burn alerts.
Add change-aware detection: anomalies shortly after deploys/config changes get higher weight.
Promote only convergent anomalies (≥2 signals) to incidents.

Pillar 2 — Root-Cause Analysis: topology + recent change

Why teams get RCA wrong: staring at graphs without context.
What works in 2025: a lightweight causal ranking:

Build/stream a service graph (traces + configs).
Watch changes (deploys, config toggles, infra mutations) with precise timestamps.
During an incident, compute blast-radius correlation (which upstream/downstream nodes share anomalies) and check “what changed” near T0.
Rank suspects: nodes with both anomalies and recent changes, especially if they sit at cut points in the graph (gateways, caches, DBs).

Outputs you want

“Probable root: payment-api v2025.09.21; deployed 6m ago; downstream orders-svc & checkout-ui anomalous; 84% confidence.”
“Top 3 suspects” + links to diff, logs, traces.

Pillar 3 — GenAI runbooks that actually execute

Great GenAI runbooks are boringly reliable. They:

Ground themselves in your docs (RAG over wiki/CMDB) and telemetry.
Emit structured steps (JSON/YAML) with pre-checks and post-checks.
Call tools (Kubernetes, cloud CLI, feature-flag API) through allowlists and HITL gates.
Fail safe: timeouts, idempotency, and one-click rollback.

Example schema (trimmed)


{
  "intent": "reduce 5xx on checkout in us-east-1",
  "plan": [
    {"check": "error-rate>5% && deploy_age<15m"},
    {"action": "scale", "target": "checkout", "min": 6, "max": 12},
    {"action": "rollback", "service": "payment-api", "to": "prev_stable", "guard": "if regression persists"},
    {"verify": "error-rate<1% for 10m && p95<400ms"}
  ],
  "human_approval": true
}

Safety gates: only approved actions; explicit regions/services; rate limits; dry-run output; audit every step.

Incident flow

Detector opens #inc-checkout-latency with suspected root + impact.
GenAI posts runbook plan (structured) + risk notes.
On-call clicks Approve or Edit & Approve (HITL).
Bot executes via SOAR/CLI; posts telemetry before/after; auto-closes ticket with summary.
Post-incident: the plan + evidence are saved as a new pattern; detectors get feedback.

30 / 60 / 90-day rollout

Days 1–30 — Stabilize & prove value

Inventory top 5 recurring incidents; document known good fixes.
Wire OpenTelemetry + change feed (deploys/configs/flags) into one timeline.
Turn static alerts into SLO burn detectors; enable change-aware weighting.
Pilot GenAI runbooks for read-only diagnosis (no writes yet).
Ship one safe auto-remediation (e.g., restart flapping pods with post-check).

Days 31–60 — Harden & automate

Add service graph + blast-radius RCA; make “what changed?” mandatory in every incident.
Expand runbooks to two-step actions (scale→verify, toggle feature→verify) with rollback.
Start a noise-review each week; kill low-value alerts; track noise compression ratio.

Days 61–90 — Operate & measure

Enforce HITL policies per risk tier; allow auto-approve for low-risk, well-tested actions.
Publish a KPI dashboard (below) to execs/SRE; iterate monthly.
Document guardrails (allowlists, budgets, blackout windows); drill failure scenarios.

KPIs that matter (and how to compute)

Noise compression (%) = 1 − (alerts reaching humans / total raw alerts). Target >70%.
MTTA / MTTR p50/p90. Trend down monthly.
Anomaly precision (%) = true incidents / (anomalies promoted). Target >60% after tuning.
Auto-remediation rate (%) = incidents resolved without human commands. Start >15%, grow to >40%.
Toil hours saved = (tickets auto-handled × avg minutes) / 60.
Change-linked incidents (%) (should be high—good! It means you see cause).
Error budget burn prevented (minutes/hours of avoided SLO violations after remediation).

Buyer’s briefing (platform vs DIY)

Platform-first (observability + AIOps suite): fastest to value, tight integrations, opinionated RCA; risk of lock-in.
DIY/composable (Otel + TSDB + rule engine + LLM + SOAR): control & cost leverage; more engineering.
Minimum requirements regardless of vendor

Native OpenTelemetry support; SLO-aware detection; change-aware correlation.
Topology/RCA that ingests traces + config/feature events.
GenAI runbooks with: RAG over your docs, structured actions, guardrails, HITL, and full audit.
Cost & cardinality controls (high-cardinality metrics, log sampling, storage lifecycle).
Clear export paths (webhooks, SOAR, chat, ITSM).

Common pitfalls

Metric monomania: single-signal detectors create noise. Always correlate ≥2 signals + SLO context.
No change feed: RCA without deploy/config/flag events is guesswork.
Unbounded GenAI: free-form shell commands are a breach waiting to happen. Use allowlists and structured outputs.
Skipping post-checks: every “fix” must verify impact on user SLOs.
Forgetting people: announce policies, clarify HITL rules, and train on-call engineers in the new flow.

Operating runbooks

Cache saturation: detect hit-ratio drop + 5xx → flush warmup → verify latency & miss rate.
Hot shard / noisy neighbor: detect skewed partition latency → shift traffic/scale shard → verify.
Bad deploy: detect post-deploy error spike → feature-flag rollback or version rollback → verify SLO.
Pod crash loop: detect restart storms → cordon/drain node or recycle deployment → verify.
External dependency slowness: detect upstream p95 blowout → circuit breaker → degrade gracefully → verify.

Security & governance for AIOps

Least privilege: remediation bots use scoped service accounts; no wildcard permissions.
Change windows & blast-radius caps: deny risky actions during blackout; limit concurrent remediations per cluster.
Approvals matrix: auto-approve low-risk; HITL for writes to prod data; two-person rule for high impact.
Full audit: capture prompts/plans/commands/telemetry before & after.

“Show me it works” — a tiny, practical pilot

Pick one service with clear SLOs and noisy alerts.
Add change feed from CI/CD + feature flags.
Build a GenAI runbook that: reads logs/traces, proposes one safe action with verify + rollback, requires HITL.
Run for two weeks; publish: noise compression, MTTR delta, auto-handled count, and saved hours. Use those numbers to scale.

#CyberDudeBivash #AIOps #Observability #SRE #IncidentResponse #ITSM #AnomalyDetection #RootCause #Runbooks #GenAI #ChatOps #Kubernetes #SLOs #Automation #DevOps #MTTR

Search This Blog

Cyberdudebivash