Runbook Automation Overview

Harness Incident Response (IR) allows teams to automate incident resolution by leveraging Runbook automation. Runbooks provide predefined workflows that execute automated actions based on specific triggers.

Key Features

Automated Response Actions

Instant incident communication
Automated remediation steps
Multi-channel notifications
Integrated team collaboration

Integration Ecosystem

Communication Tools
- Slack - Channel management and notifications
- Microsoft Teams - Team collaboration
- Zoom - Incident bridges
Ticketing Systems
- Jira - Issue tracking and updates
- ServiceNow - Incident management

Getting Started

Create Your First Runbook
- Design workflows
- Configure triggers
- Test and deploy
Configure Authentication
- Set up integration access
- Manage permissions
- Secure your runbooks
Configure Incident Fields
- Define custom fields
- Set up field mapping
- Configure templates

Example Runbook Templates

1. High CPU Usage Response

Purpose: Automated response to CPU spikes

Trigger Configuration:

Alert Type: Datadog
Metric: CPU Usage
Threshold: > 90%
Duration: 5 minutes

Action Steps:

Initial Alert
- Action Type: Slack
- Channel: #sre-alerts
- Message: "🚨 High CPU Alert: [service] CPU usage > 90% for 5 minutes"
Create Incident Bridge
- Action Type: Zoom
- Operation: Create Meeting
- Name: "CPU Incident - [service]"
- Participants: SRE Team
Automated Remediation
- Action Type: Harness Pipeline
- Pipeline: Scale Service Pipeline
- Input Variables:
  - Service: [service]
  - Replicas: +2
On-Call Notification
- Action Type: PagerDuty
- Priority: High
- Assignee: SRE On-Call
- Details: "High CPU incident - Scaling pipeline initiated"

2. Database Connection Alert

Purpose: Multi-channel incident response coordination

Trigger Configuration:

Alert Type: Grafana Incident
Service: Database
Condition: Connection Timeout
Priority: High

Action Steps:

Create Teams Channel
- Action Type: Microsoft Teams
- Operation: Create Channel
- Name: "db-incident-[timestamp]"
- Add Team: Database Support
Execute Recovery
- Action Type: Jenkins
- Job: DB-Recovery-Job
- Parameters:
  - Service: [database_service]
  - Action: restart_connections
Status Update
- Action Type: OpsGenie
- Operation: Create Alert
- Team: Database
- Message: "🔴 DB Connection Issues - Recovery Job Status: [jenkins.status]"
Incident Management
- Action Type: Grafana Incident
- Operation: Update
- Status: Investigating
- Note: "Recovery procedures initiated via Jenkins"

3. API Error Rate Response

Purpose: Feature management and incident coordination

Trigger Configuration:

Alert Type: Datadog
Metric: Error Rate
Threshold: > 5%
Time Window: 5 minutes

Action Steps:

Feature Control
- Action Type: Split
- Operation: Disable Feature
- Feature Name: new_api_version
- Environment: production
Pipeline Execution
- Action Type: GitHub Actions
- Workflow: api-recovery
- Inputs:
  - service: [service]
  - action: rollback
Team Communication
- Action Type: Slack
- Channel: #api-incidents
- Message: "⚠️ API Error Rate Incident\n- Feature flag disabled\n- Recovery workflow status: [github.status]"
Stakeholder Bridge
- Action Type: Zoom
- Operation: Create Meeting
- Name: "API Incident Bridge"
- Participants: ["@api-team", "@product"]

Next Steps

Documentation

Integration Guides

Communication Tools
Ticketing Systems
- Jira Integration
- ServiceNow Integration

Key Features​

Automated Response Actions​

Integration Ecosystem​

Getting Started​

Example Runbook Templates​

1. High CPU Usage Response​

2. Database Connection Alert​

3. API Error Rate Response​

Next Steps​

Documentation​

Integration Guides​

Key Features

Automated Response Actions

Integration Ecosystem

Getting Started

Example Runbook Templates

1. High CPU Usage Response

2. Database Connection Alert

3. API Error Rate Response

Next Steps

Documentation

Integration Guides