Skip to main content

Runbook Automation Overview

Harness Incident Response (IR) allows teams to automate incident resolution by leveraging Runbook automation. Runbooks provide predefined workflows that execute automated actions based on specific triggers.

Key Features

Automated Response Actions

  • Instant incident communication
  • Automated remediation steps
  • Multi-channel notifications
  • Integrated team collaboration

Integration Ecosystem

  • Communication Tools

  • Ticketing Systems

Getting Started

  1. Create Your First Runbook

    • Design workflows
    • Configure triggers
    • Test and deploy
  2. Configure Authentication

    • Set up integration access
    • Manage permissions
    • Secure your runbooks
  3. Configure Incident Fields

    • Define custom fields
    • Set up field mapping
    • Configure templates

Example Runbook Templates

1. High CPU Usage Response

Purpose: Automated response to CPU spikes

Trigger Configuration:

  • Alert Type: Datadog
  • Metric: CPU Usage
  • Threshold: > 90%
  • Duration: 5 minutes

Action Steps:

  1. Initial Alert

    • Action Type: Slack
    • Channel: #sre-alerts
    • Message: "🚨 High CPU Alert: [service] CPU usage > 90% for 5 minutes"
  2. Create Incident Bridge

    • Action Type: Zoom
    • Operation: Create Meeting
    • Name: "CPU Incident - [service]"
    • Participants: SRE Team
  3. Automated Remediation

    • Action Type: Harness Pipeline
    • Pipeline: Scale Service Pipeline
    • Input Variables:
      • Service: [service]
      • Replicas: +2
  4. On-Call Notification

    • Action Type: PagerDuty
    • Priority: High
    • Assignee: SRE On-Call
    • Details: "High CPU incident - Scaling pipeline initiated"

2. Database Connection Alert

Purpose: Multi-channel incident response coordination

Trigger Configuration:

  • Alert Type: Grafana Incident
  • Service: Database
  • Condition: Connection Timeout
  • Priority: High

Action Steps:

  1. Create Teams Channel

    • Action Type: Microsoft Teams
    • Operation: Create Channel
    • Name: "db-incident-[timestamp]"
    • Add Team: Database Support
  2. Execute Recovery

    • Action Type: Jenkins
    • Job: DB-Recovery-Job
    • Parameters:
      • Service: [database_service]
      • Action: restart_connections
  3. Status Update

    • Action Type: OpsGenie
    • Operation: Create Alert
    • Team: Database
    • Message: "🔴 DB Connection Issues - Recovery Job Status: [jenkins.status]"
  4. Incident Management

    • Action Type: Grafana Incident
    • Operation: Update
    • Status: Investigating
    • Note: "Recovery procedures initiated via Jenkins"

3. API Error Rate Response

Purpose: Feature management and incident coordination

Trigger Configuration:

  • Alert Type: Datadog
  • Metric: Error Rate
  • Threshold: > 5%
  • Time Window: 5 minutes

Action Steps:

  1. Feature Control

    • Action Type: Split
    • Operation: Disable Feature
    • Feature Name: new_api_version
    • Environment: production
  2. Pipeline Execution

    • Action Type: GitHub Actions
    • Workflow: api-recovery
    • Inputs:
      • service: [service]
      • action: rollback
  3. Team Communication

    • Action Type: Slack
    • Channel: #api-incidents
    • Message: "⚠️ API Error Rate Incident\n- Feature flag disabled\n- Recovery workflow status: [github.status]"
  4. Stakeholder Bridge

    • Action Type: Zoom
    • Operation: Create Meeting
    • Name: "API Incident Bridge"
    • Participants: ["@api-team", "@product"]

Next Steps

Documentation

Integration Guides