Runbooks
Create, manage, and share automated runbooks that codify your team's database operations expertise and enable consistent, reliable execution of complex procedures.
Creating Custom Runbooks
Runbooks are automated workflows that combine multiple database operations into a single executable procedure. They help standardize operations, reduce errors, and enable junior team members to execute complex tasks safely.
Basic Runbook Structure
{
"name": "Handle Slow Query Incident",
"description": "Standard procedure for investigating and resolving slow query issues",
"version": "1.2.0",
"author": "DBA Team",
"tags": ["performance", "incident-response"],
"parameters": {
"database": {
"type": "string",
"required": true,
"description": "Target database identifier"
},
"threshold_ms": {
"type": "number",
"default": 1000,
"description": "Query duration threshold in milliseconds"
}
},
"steps": [
{
"name": "Identify Slow Queries",
"action": "query_analysis",
"parameters": {
"duration_threshold": "{{threshold_ms}}",
"limit": 10
},
"output": "slow_queries"
},
{
"name": "Check System Resources",
"action": "health_check",
"parameters": {
"metrics": ["cpu", "memory", "disk_io"]
},
"output": "system_status"
},
{
"name": "Generate Recommendations",
"action": "optimize_queries",
"parameters": {
"queries": "{{slow_queries}}",
"system_status": "{{system_status}}"
},
"output": "recommendations"
}
],
"notifications": {
"on_start": {
"channels": ["slack"],
"message": "Runbook started: {{name}}"
},
"on_complete": {
"channels": ["slack", "email"],
"include_report": true
}
}
}Step Types
- • Query Execution: Run SQL queries and scripts
- • Analysis: Perform performance analysis
- • Health Checks: Verify system health
- • Optimization: Apply optimizations
- • Maintenance: Run maintenance tasks
- • Notification: Send alerts and reports
Control Flow
- • Sequential: Steps run in order
- • Conditional: Branch based on conditions
- • Parallel: Run steps concurrently
- • Loops: Iterate over data sets
- • Error Handling: Retry and fallback logic
- • Approval Gates: Require human approval
Runbook Templates
Start quickly with pre-built templates for common database operations. Templates can be customized to fit your specific needs.
Database Failover
Automated failover procedure with health checks and verification.
Steps:
- 1. Verify replica health and lag
- 2. Stop writes to primary
- 3. Promote replica to primary
- 4. Update application config
- 5. Resume writes
- 6. Verify application health
Performance Investigation
Comprehensive performance analysis and optimization workflow.
Steps:
- 1. Collect slow query logs
- 2. Analyze query execution plans
- 3. Check index usage
- 4. Identify missing indexes
- 5. Generate optimization report
- 6. Apply recommended indexes
Schema Migration
Safe schema change deployment with rollback capability.
Steps:
- 1. Backup current schema
- 2. Validate migration script
- 3. Apply changes in transaction
- 4. Run smoke tests
- 5. Monitor for errors
- 6. Rollback if issues detected
Capacity Planning
Automated capacity analysis and growth forecasting.
Steps:
- 1. Collect usage metrics
- 2. Analyze growth trends
- 3. Forecast future needs
- 4. Identify bottlenecks
- 5. Generate scaling recommendations
- 6. Create budget estimates
Triggering Runbooks Automatically
Runbooks can be triggered automatically based on alerts, schedules, or custom conditions, enabling proactive database management.
Alert-Triggered Runbooks
Automatically execute runbooks when specific alerts fire.
{
"alert_triggers": [
{
"alert_name": "High CPU Usage",
"conditions": {
"severity": ["warning", "critical"],
"duration_seconds": 300
},
"runbook": "performance-investigation",
"parameters": {
"database": "{{alert.database}}",
"threshold_ms": 500
},
"require_approval": false
},
{
"alert_name": "Replication Lag",
"conditions": {
"severity": "critical"
},
"runbook": "restart-replication",
"parameters": {
"database": "{{alert.database}}"
},
"require_approval": true,
"approvers": ["@dba-team"]
}
]
}Schedule-Triggered Runbooks
Run runbooks on a regular schedule for maintenance and reporting.
{
"scheduled_runbooks": [
{
"runbook": "weekly-health-check",
"schedule": {
"type": "weekly",
"day": "monday",
"time": "09:00",
"timezone": "America/New_York"
},
"parameters": {
"databases": ["prod-db-01", "prod-db-02"],
"report_recipients": ["[email protected]"]
}
},
{
"runbook": "monthly-capacity-planning",
"schedule": {
"type": "monthly",
"day_of_month": 1,
"time": "10:00",
"timezone": "UTC"
},
"parameters": {
"forecast_months": 6
}
}
]
}Webhook-Triggered Runbooks
Trigger runbooks via API calls from external systems or CI/CD pipelines.
POST /api/v1/runbooks/execute
Authorization: Bearer <api_token>
Content-Type: application/json
{
"runbook": "schema-migration",
"parameters": {
"database": "prod-db-01",
"migration_script": "migrations/20260207_add_user_indexes.sql",
"dry_run": false
},
"metadata": {
"triggered_by": "github-actions",
"commit_sha": "abc123...",
"pr_number": 456
}
}Variables and Conditions
Create dynamic runbooks that adapt to different scenarios using variables and conditional logic.
Variable Types and Usage
{
"variables": {
// Input parameters
"database": "{{input.database}}",
"threshold": "{{input.threshold}}",
// Step outputs
"slow_queries": "{{steps.identify_slow_queries.output}}",
"cpu_usage": "{{steps.check_resources.output.cpu}}",
// System variables
"current_time": "{{system.timestamp}}",
"executor": "{{system.user}}",
// Environment variables
"environment": "{{env.DEPLOYMENT_ENV}}",
"region": "{{env.AWS_REGION}}",
// Computed values
"is_production": "{{environment == 'production'}}",
"is_critical": "{{cpu_usage > 90}}"
}
}Conditional Execution
{
"steps": [
{
"name": "Check CPU Usage",
"action": "get_metric",
"parameters": {
"metric": "cpu_usage"
},
"output": "cpu_usage"
},
{
"name": "Kill Expensive Queries",
"action": "kill_queries",
"condition": "{{cpu_usage > 85}}",
"parameters": {
"min_duration_seconds": 300
}
},
{
"name": "Scale Up Instance",
"action": "scale_resources",
"condition": "{{cpu_usage > 90 && is_production}}",
"require_approval": true,
"parameters": {
"instance_type": "next_tier"
}
},
{
"name": "Send Alert",
"action": "notify",
"condition": "{{cpu_usage > 95}}",
"parameters": {
"severity": "critical",
"channels": ["pagerduty", "slack"]
}
}
]
}Loops and Iteration
Execute steps for each item in a collection.
{
"steps": [
{
"name": "Get All Databases",
"action": "list_databases",
"parameters": {
"environment": "production"
},
"output": "databases"
},
{
"name": "Health Check Each Database",
"action": "health_check",
"for_each": "{{databases}}",
"parameters": {
"database": "{{item}}",
"checks": ["connections", "replication", "disk_space"]
},
"output": "health_results"
},
{
"name": "Generate Report",
"action": "create_report",
"parameters": {
"results": "{{health_results}}",
"format": "html"
}
}
]
}Sharing Runbooks with Team
Build a shared knowledge base by creating, versioning, and sharing runbooks across your organization.
Version Control
Track changes to runbooks with built-in versioning.
- • Semantic versioning (1.2.3)
- • Change history and diffs
- • Rollback to previous versions
- • Version tags and labels
Access Control
Control who can view, edit, and execute runbooks.
- • Role-based permissions
- • Execution approval workflows
- • Audit logs for all actions
- • Team and user assignments
Runbook Library
Organize runbooks in a searchable library.
- • Categories and tags
- • Search and filter
- • Favorites and bookmarks
- • Usage statistics
Execution History
Track all runbook executions for compliance.
- • Complete execution logs
- • Input/output parameters
- • Duration and success rate
- • Error details and debugging
Sharing Configuration
{
"sharing": {
"visibility": "team",
"team_id": "dba-team",
"permissions": {
"view": ["@everyone"],
"execute": ["@dba-team", "@ops-team"],
"edit": ["@senior-dba"],
"delete": ["@dba-lead"]
}
},
"approval_workflow": {
"required_for": ["production"],
"approvers": ["@dba-lead", "@ops-lead"],
"require_all": false,
"timeout": "30m"
},
"documentation": {
"description": "Standard procedure for handling database incidents",
"prerequisites": [
"Access to production databases",
"Familiarity with query analysis tools"
],
"estimated_duration": "15-30 minutes",
"risk_level": "medium",
"runbook_url": "https://wiki.company.com/runbooks/db-incident"
}
}Runbook Best Practices
Start Simple, Iterate
Begin with manual procedures documented as simple runbooks. Gradually add automation, error handling, and advanced features as you gain confidence.
Include Rollback Steps
Every runbook that makes changes should include clear rollback procedures. Test rollback paths as thoroughly as the main workflow.
Add Comprehensive Logging
Log all significant actions, decisions, and state changes. Detailed logs are invaluable for debugging and auditing.
Test in Non-Production First
Always test new runbooks in staging or development environments before using them in production. Validate both success and failure scenarios.
Document Prerequisites and Context
Include clear documentation about when to use the runbook, what permissions are needed, and what side effects to expect. Link to related documentation and runbooks.
Version Control Everything
Treat runbooks like code. Use semantic versioning, maintain change logs, and consider storing runbook definitions in Git for additional version control.
Complete Runbook Example
Production Incident Response Runbook
{
"name": "Production Database Incident Response",
"version": "2.1.0",
"description": "Comprehensive incident response procedure for production database issues",
"author": "DBA Team",
"last_updated": "2026-02-07",
"tags": ["incident", "production", "emergency"],
"parameters": {
"database": {
"type": "string",
"required": true,
"description": "Affected database identifier"
},
"incident_type": {
"type": "enum",
"values": ["performance", "connectivity", "data_corruption", "other"],
"required": true
},
"severity": {
"type": "enum",
"values": ["low", "medium", "high", "critical"],
"default": "high"
}
},
"steps": [
{
"name": "Create Incident Ticket",
"action": "create_ticket",
"parameters": {
"title": "Database Incident: {{database}}",
"severity": "{{severity}}",
"type": "{{incident_type}}"
},
"output": "incident_ticket"
},
{
"name": "Notify On-Call Team",
"action": "notify",
"parameters": {
"channels": ["pagerduty", "slack"],
"message": "Incident response runbook started for {{database}}",
"ticket_link": "{{incident_ticket.url}}"
}
},
{
"name": "Capture System State",
"action": "capture_state",
"parallel": true,
"substeps": [
{
"name": "Get Current Metrics",
"action": "get_metrics",
"parameters": {
"database": "{{database}}",
"metrics": ["cpu", "memory", "connections", "query_stats"]
},
"output": "current_metrics"
},
{
"name": "Get Slow Queries",
"action": "get_slow_queries",
"parameters": {
"database": "{{database}}",
"limit": 20
},
"output": "slow_queries"
},
{
"name": "Check Replication",
"action": "check_replication",
"parameters": {
"database": "{{database}}"
},
"output": "replication_status"
}
]
},
{
"name": "Analyze Root Cause",
"action": "analyze_incident",
"parameters": {
"metrics": "{{current_metrics}}",
"slow_queries": "{{slow_queries}}",
"replication": "{{replication_status}}"
},
"output": "analysis"
},
{
"name": "Apply Quick Fixes",
"condition": "{{severity == 'critical' || severity == 'high'}}",
"substeps": [
{
"name": "Kill Long-Running Queries",
"condition": "{{analysis.has_long_queries}}",
"action": "kill_queries",
"parameters": {
"min_duration_seconds": 300,
"exclude_admin": true
}
},
{
"name": "Clear Connection Pool",
"condition": "{{analysis.connection_pool_exhausted}}",
"action": "clear_connections",
"parameters": {
"database": "{{database}}",
"idle_only": true
}
}
]
},
{
"name": "Generate Recommendations",
"action": "generate_recommendations",
"parameters": {
"analysis": "{{analysis}}"
},
"output": "recommendations"
},
{
"name": "Request Approval for Structural Changes",
"condition": "{{recommendations.requires_schema_changes}}",
"action": "request_approval",
"parameters": {
"approvers": ["@dba-lead", "@engineering-director"],
"timeout": "30m",
"details": "{{recommendations}}"
},
"output": "approval"
},
{
"name": "Apply Approved Recommendations",
"condition": "{{approval.approved}}",
"action": "apply_recommendations",
"parameters": {
"recommendations": "{{recommendations}}",
"database": "{{database}}"
}
},
{
"name": "Verify Resolution",
"action": "health_check",
"parameters": {
"database": "{{database}}",
"wait_duration": "5m"
},
"output": "health_status"
},
{
"name": "Update Incident Ticket",
"action": "update_ticket",
"parameters": {
"ticket_id": "{{incident_ticket.id}}",
"status": "resolved",
"resolution_notes": "{{analysis}}\n\n{{recommendations}}"
}
},
{
"name": "Generate Incident Report",
"action": "create_report",
"parameters": {
"template": "incident_postmortem",
"data": {
"incident": "{{incident_ticket}}",
"analysis": "{{analysis}}",
"recommendations": "{{recommendations}}",
"resolution": "{{health_status}}"
},
"format": "pdf"
},
"output": "report"
},
{
"name": "Send Final Notification",
"action": "notify",
"parameters": {
"channels": ["slack", "email"],
"recipients": ["@dba-team", "@ops-team"],
"message": "Incident resolved for {{database}}",
"attachments": ["{{report}}"]
}
}
],
"error_handling": {
"on_error": "notify_and_pause",
"notification_channels": ["pagerduty", "slack"],
"allow_manual_intervention": true,
"max_retries": 3,
"retry_delay": "30s"
},
"rollback": {
"supported": true,
"automatic": false,
"steps": [
{
"name": "Revert Configuration Changes",
"action": "revert_config",
"parameters": {
"snapshot_id": "{{system.execution_start_snapshot}}"
}
},
{
"name": "Restore Previous State",
"action": "restore_state",
"parameters": {
"backup_id": "{{system.pre_execution_backup}}"
}
}
]
},
"notifications": {
"on_start": {
"channels": ["slack"],
"message": "Incident response started for {{database}}"
},
"on_complete": {
"channels": ["slack", "email"],
"include_report": true
},
"on_error": {
"channels": ["pagerduty", "slack"],
"severity": "critical"
}
},
"metadata": {
"estimated_duration": "20-45 minutes",
"risk_level": "medium",
"requires_approval": true,
"testing_status": "validated",
"last_test_date": "2026-02-01",
"documentation_url": "https://wiki.company.com/incident-response"
}
}