Runbook Automation Overview
Harness Incident Response (IR) allows teams to automate incident resolution by leveraging Runbook automation. Runbooks provide predefined workflows that execute automated actions based on specific triggers.
Key Features
Automated Response Actions
- Instant incident communication
- Automated remediation steps
- Multi-channel notifications
- Integrated team collaboration
Integration Ecosystem
-
Communication Tools
- Slack - Channel management and notifications
- Microsoft Teams - Team collaboration
- Zoom - Incident bridges
-
Ticketing Systems
- Jira - Issue tracking and updates
- ServiceNow - Incident management
Getting Started
-
- Design workflows
- Configure triggers
- Test and deploy
-
- Set up integration access
- Manage permissions
- Secure your runbooks
-
- Define custom fields
- Set up field mapping
- Configure templates
Example Runbook Templates
1. High CPU Usage Response
Purpose: Automated response to CPU spikes
Trigger Configuration:
- Alert Type: Datadog
- Metric: CPU Usage
- Threshold: > 90%
- Duration: 5 minutes
Action Steps:
-
Initial Alert
- Action Type: Slack
- Channel: #sre-alerts
- Message: "🚨 High CPU Alert: [service] CPU usage > 90% for 5 minutes"
-
Create Incident Bridge
- Action Type: Zoom
- Operation: Create Meeting
- Name: "CPU Incident - [service]"
- Participants: SRE Team
-
Automated Remediation
- Action Type: Harness Pipeline
- Pipeline: Scale Service Pipeline
- Input Variables:
- Service: [service]
- Replicas: +2
-
On-Call Notification
- Action Type: PagerDuty
- Priority: High
- Assignee: SRE On-Call
- Details: "High CPU incident - Scaling pipeline initiated"
2. Database Connection Alert
Purpose: Multi-channel incident response coordination
Trigger Configuration:
- Alert Type: Grafana Incident
- Service: Database
- Condition: Connection Timeout
- Priority: High
Action Steps:
-
Create Teams Channel
- Action Type: Microsoft Teams
- Operation: Create Channel
- Name: "db-incident-[timestamp]"
- Add Team: Database Support
-
Execute Recovery
- Action Type: Jenkins
- Job: DB-Recovery-Job
- Parameters:
- Service: [database_service]
- Action: restart_connections
-
Status Update
- Action Type: OpsGenie
- Operation: Create Alert
- Team: Database
- Message: "🔴 DB Connection Issues - Recovery Job Status: [jenkins.status]"
-
Incident Management
- Action Type: Grafana Incident
- Operation: Update
- Status: Investigating
- Note: "Recovery procedures initiated via Jenkins"
3. API Error Rate Response
Purpose: Feature management and incident coordination
Trigger Configuration:
- Alert Type: Datadog
- Metric: Error Rate
- Threshold: > 5%
- Time Window: 5 minutes
Action Steps:
-
Feature Control
- Action Type: Split
- Operation: Disable Feature
- Feature Name: new_api_version
- Environment: production
-
Pipeline Execution
- Action Type: GitHub Actions
- Workflow: api-recovery
- Inputs:
- service: [service]
- action: rollback
-
Team Communication
- Action Type: Slack
- Channel: #api-incidents
- Message: "⚠️ API Error Rate Incident\n- Feature flag disabled\n- Recovery workflow status: [github.status]"
-
Stakeholder Bridge
- Action Type: Zoom
- Operation: Create Meeting
- Name: "API Incident Bridge"
- Participants: ["@api-team", "@product"]
Next Steps
Documentation
Integration Guides
- Communication Tools
- Ticketing Systems